You've probably seen this happen already. Two annotators review the same support ticket, image, or transcript and come back with different labels. One says the customer is frustrated. The other says neutral. One tags an object as damaged. The other marks it acceptable. If you're training a model on that dataset, the disagreement isn't a side issue. It's the dataset telling you the task, the guidelines, or the reviewer setup still has gaps.
That's where inter rater reliability calculation becomes operational, not academic. You're not calculating a score just to satisfy a QA checklist. You're measuring whether your labeling process is stable enough to support model training, downstream analytics, or research conclusions. The useful question isn't “what's the best IRR metric in general?” It's “what metric matches my data, my annotators, and the way this project runs?”
A lot of junior teams get stuck because they jump straight to Cohen's kappa or Fleiss' kappa without first checking the shape of the problem. Is the label categorical or continuous? Are there two raters or many? Are all items rated by everyone? Is missing data normal in the workflow? Those choices matter more than the formula itself.
Why Consistent Data Annotation is Non-Negotiable
When annotation drift shows up, model quality usually degrades before anyone notices why. The training pipeline runs. Metrics arrive. People debate architecture changes. Meanwhile, the root issue is often simpler: the labels aren't consistent enough to teach the model a stable pattern.

That risk goes beyond classic NLP labeling. Teams doing document parsing, invoice extraction, multilingual transcription, and computer vision all hit the same wall. If your pipeline starts with noisy source material, even strong preprocessing won't save weak human judgment rules. That's one reason teams working on OCR and structured document workflows often spend time tightening upstream ingestion and extraction standards before they optimize models. A useful example is DigiParser's data extraction, where the data capture layer itself affects how much ambiguity reaches annotators.
Agreement is not the same as reliability
A beginner mistake is to use raw agreement alone and assume the problem is solved. If two raters agree most of the time, that sounds reassuring. But agreement by itself can hide weak labeling design, especially when one class dominates and both annotators mostly choose it.
A methodological review notes that percent agreement is the simplest approach but can overestimate reliability because it ignores chance agreement, while kappa-type statistics adjust for chance. The same review recommends reporting both percent agreement and kappa instead of treating one number as universally sufficient, especially when agreement metrics can shift under class imbalance or prevalence effects (methodological review on agreement measures).
Practical rule: If a label is rare, raw agreement can look comforting while your positive-class consistency is actually shaky.
Why this changes how teams manage annotation
Inter rater reliability calculation is really a management tool for label quality. It tells you whether disagreement is random, systematic, or built into the task design. That matters whether you're labeling sentiment, named entities, product defects, speech segments, or medical text.
Use IRR early, not after the full project is complete. A pilot batch exposes ambiguity faster than a postmortem on a finished dataset. Teams that treat reliability as an intake check usually spend less time relabeling and less time arguing over “bad model behavior” that started with inconsistent human decisions.
If you need a quick refresher on the broader workflow before diving into metrics, this overview of data annotation fundamentals is a useful grounding point.
Choosing the Right Inter Rater Reliability Metric
The metric has to match the annotation setup. If it doesn't, the output may still look official, but it won't answer the actual question. I've seen teams use Cohen's kappa on workflows that weren't really pairwise, and I've seen others force Fleiss' kappa onto incomplete rating grids where missing labels were normal. In both cases, the number looked precise and the interpretation was wrong.
Start with the shape of the problem
Pick the metric by answering four questions:
- How many raters are involved
- What kind of data are they producing
- Do you need to account for chance agreement
- Are missing labels part of the workflow
Many guides fall short by naming common statistics but not cleanly mapping them to binary, ordinal, continuous, or multi-rater annotation setups. That gap matters in production because the wrong metric can make a label set look stronger or weaker than it really is.
Inter-Rater Reliability Metrics Compared
| Metric | Number of Raters | Data Type | Handles Chance | Handles Missing Data |
|---|---|---|---|---|
| Percent Agreement | Two or more | Best as a simple categorical summary | No | No, not gracefully |
| Cohen's Kappa | Two | Categorical, commonly nominal | Yes | Generally expects paired ratings |
| Fleiss' Kappa | Three or more | Categorical | Yes | Not a natural fit for incomplete rating patterns |
| Krippendorff's Alpha | Two or more | Nominal, ordinal, interval, ratio | Yes | Yes |
What works well and what tends to fail
Percent agreement works as a first-pass descriptive check. It's easy to explain to stakeholders and easy to compute in a spreadsheet. It fails when people treat it as a full reliability verdict.
Cohen's kappa works well when exactly two raters evaluate the same items using the same category set. It's often the cleanest choice for pilot labeling rounds with paired reviewers. It fails when the workflow isn't pairwise or when teams ignore prevalence effects and assume the score reflects pure annotation quality.
Fleiss' kappa works when multiple raters assess items in a structured categorical setup. It's useful for pooled annotation teams and adjudication pilots. It gets awkward when the project has skipped items, specialist raters, or staggered review patterns.
Krippendorff's alpha is the flexible choice for messy operations. It's usually the right answer when data types vary, rating participation is incomplete, or your project has a mix of nominal and ordered judgments. It's less intuitive for beginners, but in real operations it often reflects reality better than the cleaner textbook choices.
Don't ask “Which metric is best?” Ask “Which metric fits the annotation mechanics I actually have?”
A simple selection guide
Use this decision shortcut in practice:
- Two raters, same categorical labels on every item: Start with Cohen's kappa
- Three or more raters, categorical labels, structured assignment: Use Fleiss' kappa
- Ordinal, interval, continuous, or incomplete data: Use Krippendorff's alpha
- Need a plain-language summary for non-technical stakeholders: Report percent agreement alongside the main reliability metric
One subtle point matters a lot in mature annotation programs. Reliability metrics tell you how consistently people apply a scheme. They do not automatically tell you whether the scheme itself makes sense. A team can apply a flawed rule very consistently.
Calculating Cohen's Kappa for Two Raters
Cohen's kappa is usually the first serious inter rater reliability calculation teams implement because many annotation pilots start with two reviewers on the same sample set. The setup is straightforward: both raters label the same items, and you want agreement adjusted for what could happen by chance.

A worked example with binary labels
Suppose two annotators label a batch of images as either cat or no cat. Put the results into a confusion matrix:
| Rater B cat | Rater B no cat | |
|---|---|---|
| Rater A cat | a | b |
| Rater A no cat | c | d |
From that matrix:
- Observed agreement is the proportion of items where both raters chose the same label. That's the diagonal cells over the total.
- Expected agreement is the agreement you'd expect from the raters' marginal label tendencies alone.
- Cohen's kappa is calculated as:
kappa = (observed agreement – expected agreement) / (1 – expected agreement)
The reason this works well in practice is that it penalizes easy agreement. If both annotators call almost everything “no cat,” percent agreement may look high even if they're not reliably identifying actual cats. Kappa forces you to confront that.
Python example with scikit-learn
For day-to-day work, don't hand-calculate this beyond a small sanity check. Use code, and keep the labeling vectors versioned with the annotation batch.
from sklearn.metrics import cohen_kappa_score, confusion_matrix
rater_a = ["cat", "cat", "no_cat", "cat", "no_cat", "no_cat"]
rater_b = ["cat", "no_cat", "no_cat", "cat", "no_cat", "cat"]
kappa = cohen_kappa_score(rater_a, rater_b)
cm = confusion_matrix(rater_a, rater_b, labels=["cat", "no_cat"])
print("Cohen's kappa:", kappa)
print("Confusion matrix:\n", cm)
This should be part of the same notebook or evaluation script where you inspect disagreement examples. Never stop at the coefficient alone. Pull the mismatched records and read them.
R example with irr
If your team works in R, the irr package handles this cleanly.
library(irr)
ratings <- data.frame(
rater_a = c("cat", "cat", "no_cat", "cat", "no_cat", "no_cat"),
rater_b = c("cat", "no_cat", "no_cat", "cat", "no_cat", "cat")
)
kappa2(ratings)
How to interpret the result
Interpretation is where junior analysts often overstate confidence. A kappa score isn't a badge of success by itself. It's evidence about consistency under a specific setup, with a specific sample, under a specific guideline version.
Use Cohen's kappa when:
- The task is pairwise and every item has both ratings
- Categories are clearly defined, such as binary defect flags or sentiment bins
- You need a defensible baseline before scaling to a larger annotation pool
Don't use it when your workflow already involves rotating labelers, skipped items, or ordered ratings that deserve a more flexible treatment. In those cases, forcing pairwise kappa usually creates cleanup work later.
If your team is building supervised datasets for production models, this operational view of machine learning data labeling helps connect the reliability check to the rest of the training pipeline.
The confusion matrix usually tells you more than the summary coefficient. If all the disagreement sits in one class boundary, the fix is probably in the guidelines, not the raters.
Scaling to Multiple Raters with Fleiss' Kappa
Once more than two annotators are involved, pairwise Cohen's kappa stops being enough. You can still compute pairwise scores, but that quickly becomes messy and hard to summarize. Fleiss' kappa gives you a group-level view for categorical annotation.
Where Fleiss' kappa fits
Use it when you have multiple raters assigning categorical labels and the evaluation setup is reasonably structured. A classic example is a review batch where several annotators classify the same support tickets as positive, neutral, or negative.
The data format is different from the pairwise setup. Instead of one row per item with one column per specific rater label, Fleiss' kappa often uses counts by category for each item.
| Ticket | Positive | Neutral | Negative |
|---|---|---|---|
| 1 | 3 | 1 | 1 |
| 2 | 0 | 4 | 1 |
| 3 | 1 | 1 | 3 |
Each row tells you how many raters chose each class for that item.
What the statistic is actually doing
Fleiss' kappa compares observed agreement across the panel with the agreement you'd expect if category assignments followed the overall distribution by chance. That gives you a single summary of how aligned the group is.
The trade-off is important. Fleiss' kappa is helpful when raters are treated as part of a pool rather than as fixed, individually meaningful judges. If your workflow depends heavily on specialist reviewers who are not interchangeable, this can be a bad fit. The number may still compute, but the assumption underneath it won't match the operation.
Python example
statsmodels supports Fleiss' kappa using a category-count matrix.
import numpy as np
from statsmodels.stats.inter_rater import fleiss_kappa
# Rows are items, columns are category counts
ratings = np.array([
[3, 1, 1],
[0, 4, 1],
[1, 1, 3],
[5, 0, 0]
])
kappa = fleiss_kappa(ratings)
print("Fleiss' kappa:", kappa)
If your raw data is in long format, write a preprocessing step that groups by item and category, then pivots into this matrix. That preprocessing step is where many bugs happen, so test it carefully.
R example
The irr package is again a common choice.
library(irr)
ratings <- matrix(c(
3,1,1,
0,4,1,
1,1,3,
5,0,0
), ncol = 3, byrow = TRUE)
kappam.fleiss(ratings)
Common mistakes with multi-rater IRR
Teams usually run into trouble in one of these ways:
- Uneven assignment patterns: Not every rater reviewed every item, but the analysis assumes they did.
- Specialist raters: A legal annotator, a clinician, and a generalist are treated like interchangeable reviewers.
- Adjudicated labels mixed in: Final corrected labels get blended into the raw agreement dataset.
- Category collapse after the fact: Teams merge labels after annotation to improve agreement, then report the score as if that was the original task.
A cleaner habit is to calculate Fleiss' kappa on a calibration batch before the full run. That tells you whether the category system is stable enough to scale.
Using Krippendorff's Alpha for Maximum Flexibility
Real annotation programs rarely stay clean for long. A reviewer skips an item. A specialist only labels a subset. The task moves from simple categories to ordered severity levels. A score becomes continuous instead of nominal. In such cases, Krippendorff's alpha usually earns its place.

Why teams reach for alpha
Krippendorff's alpha is the most flexible option in the standard toolkit because it works with:
- Any number of raters
- Different measurement levels, including nominal, ordinal, interval, and ratio
- Missing data, without forcing you to throw out entire items
That makes it a strong fit for actual production workflows, not just textbook annotation exercises.
When alpha is the better choice
Use alpha when the annotation job includes any of the following:
- A moderation queue where some items are escalated to senior reviewers only
- Ordinal labels such as low, medium, high severity
- Scored judgments such as quality ratings or relevance scores
- Incomplete review grids caused by skip logic or workload balancing
This flexibility matters because annotation operations are often iterative. Teams refine guidelines, reroute hard cases, and involve non-interchangeable reviewers. A qualitative methods resource outlines collaborative and iterative approaches to IRR and explicitly cautions that quantitative reliability should be used alongside discussion, memoing, and documentation rather than as a standalone verdict (Dedoose overview of inter-rater reliability workflows).
If missing labels are part of the real workflow, don't “clean” them away just to use a simpler metric.
Python and R examples
In Python, many teams use simpledorff for Krippendorff's alpha with long-format data.
import pandas as pd
import simpledorff
df = pd.DataFrame({
"item_id": [1,1,2,2,2,3,3],
"rater_id": ["A","B","A","B","C","A","C"],
"rating": ["low","medium","high","high","medium","low","low"]
})
alpha = simpledorff.calculate_krippendorffs_alpha_for_df(
df,
experiment_col="item_id",
annotator_col="rater_id",
class_col="rating"
)
print("Krippendorff's alpha:", alpha)
In R, packages vary by implementation, but the workflow is similar: prepare a matrix or long-format structure, specify the measurement level correctly, then compute alpha.
# Example structure only. Package choice varies by team setup.
library(irr)
ratings <- data.frame(
rater_a = c("low", "high", "low"),
rater_b = c("medium", "high", NA),
rater_c = c(NA, "medium", "low")
)
# Use an implementation that supports Krippendorff's alpha
# and specify the correct data type for your labels.
If your ML pipeline includes complex supervised data preparation, this guide to data for training AI systems is a useful companion to the reliability side of the process.
Beyond the Score Acting on Your IRR Results
The most expensive mistake in inter rater reliability calculation is treating the score as the finish line. It isn't. It's a diagnostic readout. The score tells you where to investigate, not whether the dataset is automatically safe to use.

What a low score usually means
Low agreement can come from at least three different causes:
- The guidelines are vague. Annotators are making good-faith decisions with different interpretations.
- The task is by its nature subjective. Borderline items indeed support more than one reading.
- The rater pool is misaligned. Reviewers have different training, domain knowledge, or threshold habits.
That distinction matters because the response should differ. Interrater reliability indices measure consistency, but not whether the coding scheme itself is well-defined. In practice, teams need to separate disagreement caused by unclear guidelines from disagreement caused by true subjectivity, and qualitative-methods guidance recommends using IRR alongside discussion and documentation rather than as a standalone verdict, especially in areas like multilingual transcription, image annotation, and domain-specific labeling.
A practical response loop
When I review annotation quality with a team, I don't start by asking whether the score is “good.” I ask what the disagreement rows have in common.
Use this loop:
Pull the disagreement set
Review the exact items that produced conflict. Group them by pattern, not by annotator.Tag the reason for disagreement
Was the issue missing guideline coverage, unclear boundary definitions, low input quality, or real ambiguity?Revise the instructions
Add examples, counterexamples, and tie-break rules. A one-line label definition usually isn't enough for edge cases.Run a fresh calibration batch
Re-test on a targeted sample before restarting large-scale labeling.
High consistency with bad rules still produces bad training data.
What to document every time
A mature annotation program keeps a lightweight audit trail. At minimum, document:
- Guideline version used for the IRR batch
- Who rated the sample, including specialist vs generalist roles
- What metric was used and why it matched the task
- Which disagreements triggered changes
- What changed before the next round
That record matters because reliability is often iterative. The workflow changes, the raters change, and the task definition sharpens over time. If you don't log those changes, the score history becomes hard to interpret.
The teams that get this right use IRR as part of a recurring quality loop. They calculate it on calibration samples, inspect disagreements, revise the playbook, retrain reviewers, and only then scale. That's how the metric becomes useful to model performance instead of just satisfying reporting.
If your team needs annotation operations that can scale without losing consistency, Zilo AI can help with text, image, voice, transcription, and multilingual data services built for AI workflows. A strong annotation program needs more than headcount. It needs clear guidelines, disciplined review loops, and people who can apply them consistently.
