← All Articles

Human Evaluation Frameworks for LLM Systems

Every evaluation framework eventually asks the same question: but what do humans think? Automated metrics are cheap and fast, and they are often insufficient for the dimensions that matter most to a downstream user. Human evaluation is irreplaceable for those dimensions, but only when annotation tasks are structured to produce reliable data and only when human pipelines complement automated monitoring rather than replace it.

The Irreducible Role of Human Judgment

Automated evaluation has come a long way. LLM-as-judge systems can score thousands of responses per hour, and RAGAS and similar frameworks can compute faithfulness, relevance, and answer correctness without a single human annotator. For many production monitoring tasks, these tools are more than adequate. The "human-anchored" tradition behind every such tool runs back to the Cleverdon experiments at Cranfield in the 1960s and the TREC shared evaluations that followed: human experts produce the ground-truth judgments, and automated systems are measured against those judgments.^A

But automated evaluation has a fundamental dependency: it needs to be calibrated against something. That something is human judgment. An LLM judge that scores "helpfulness" is only as good as the human-labeled examples that defined what helpfulness means in the first place. When the judge drifts, when new edge cases emerge, when the task changes, you need humans back in the loop. This dependency is structural rather than incidental: preference-based learning is the formal mechanism by which human judgments anchor every step of LLM post-training, which means the human preference signal sits underneath the entire stack.^B

This is not a limitation to be solved; it is a structural feature of evaluation itself. Language is ambiguous and quality is context-dependent, and the boundary between "acceptable" and "unacceptable" shifts with domain, audience, and use case. No metric, no matter how sophisticated, can fully encode those judgments without periodic human recalibration.

The goal is not to replace automated evaluation with human evaluation. The goal is to build hybrid systems where human judgment is deployed strategically, structured carefully, and integrated tightly with automated pipelines.

When Human Evaluation Is Necessary

Not every evaluation task requires human annotators. The decision to involve humans should be deliberate, driven by specific conditions where automated methods fall short.

High-Stakes Domains

Medical, legal, and financial applications carry consequences that demand human verification. A chatbot that gives incorrect dosage information cannot be evaluated by another language model that might share the same blind spots. Hallucination, toxicity, and sycophancy are documented failure modes of LLMs; pre-LLM systems already gave medical advice that, if followed, would have caused harm.^C In these domains, human evaluation is a regulatory and ethical requirement, not a luxury. There is a growing argument that LLMs entering clinical settings should undergo something resembling clinical-trial-grade oversight rather than the ad-hoc release process that other software products follow.^D

The annotators in these cases are typically domain experts: physicians, attorneys, financial analysts. Their time is expensive, which makes task design even more critical. Every annotation hour needs to produce maximum signal.

Novel Tasks Without Established Metrics

When you build a new kind of LLM application, there may be no existing metric that captures what "good" means. Consider an LLM that generates therapy session summaries for clinicians. There is no BLEU score for clinical empathy. No ROUGE variant that measures whether the summary captures the patient's emotional state accurately. You need humans to define the quality dimensions before you can automate anything. Standardized benchmarks help when they exist, but doing well at one benchmark does not guarantee a system will adapt reliably to new tasks.^E

Calibrating Automated Judges

LLM-as-judge systems require calibration data: a set of examples where the "correct" evaluation is established by human consensus. Without this anchor, the judge's scores are just numbers. They correlate with quality only to the extent that the judge's training data happened to align with your specific definition of quality. Periodic human evaluation creates the ground truth that keeps automated judges honest. Even datasets like UltraFeedback that use one LLM to grade another trace their calibration back to human preference data: the grader was itself shaped by RLHF during its own training.³

Subjective Quality Dimensions

Tone, empathy, cultural sensitivity, humor, persuasiveness. These dimensions resist clean formalization. An LLM judge can approximate them, but the approximation needs validation. Is the model's notion of "empathetic" aligned with what your users actually experience as empathetic? Only human evaluation can answer that. Sycophantic models are particularly poor evaluators of subjective dimensions: they tend to agree with the response in front of them, which inflates scores on the very dimensions human users care about most.^F

Product Launch Baselines

When launching a new LLM-powered feature, you need baselines. What is the current quality level? Where are the failure modes? What does the distribution of quality look like across different query types? Human evaluation during the launch phase establishes these baselines. Once established, automated systems can monitor for drift from the baseline. But the baseline itself must come from humans.

The Annotation Pipeline

An annotation pipeline is more than "give responses to people and ask if they're good." It is a structured process that transforms subjective judgment into quantitative data. The pipeline has distinct stages, and cutting corners at any stage degrades the quality of every subsequent stage.

Task Decomposition

The single most common mistake in human evaluation is asking annotators to evaluate too many things at once. "Rate the quality of this response on a scale of 1 to 5" is a question that sounds simple but is actually asking the annotator to simultaneously assess factual accuracy, relevance, completeness, tone, formatting, and coherence, then collapse all of those dimensions into a single number.

Decompose the evaluation into specific, independent dimensions. Each dimension should be evaluable without reference to the others. Here is the difference between a poorly structured task and a well-structured one:

Poorly structured:

"Rate the overall quality of this response from 1 (terrible) to 5 (excellent)."

Well-structured:

"For this response, evaluate each dimension independently:

1. Factual accuracy: Does the response contain any factual errors? (Yes / No / Cannot determine)
2. Completeness: Does the response address all parts of the user's question? (Fully / Partially / Not at all)
3. Relevance: Is all information in the response relevant to the question? (All relevant / Some irrelevant / Mostly irrelevant)
4. Tone: Is the tone appropriate for the context? (Appropriate / Slightly off / Inappropriate)"

The decomposed version takes longer per annotation but produces data that is dramatically more useful. You can identify exactly which dimensions are failing. You can compute agreement per dimension. You can discover that your model is factually accurate but tonally inappropriate, a finding that "3.2 out of 5 overall quality" would completely obscure. Decomposing a composite into independent measurable dimensions is itself an old idea: the Cranfield experiments separated precision from recall because errors come in different forms, and conflating them obscures the diagnostic signal.^G

What Annotators Are Actually Judging

Be precise about the unit of evaluation. Are annotators judging a single response in isolation? A response given a specific prompt? A response compared to a reference answer? A response compared to another model's response? Each framing produces different data and requires different guidelines.

Evaluation of a response in isolation measures absolute quality but is susceptible to annotator calibration differences. Comparative evaluation (response A vs. response B) controls for calibration but only produces relative rankings. The choice depends on what you need to learn.

Writing Annotation Guidelines

The annotation guidelines document is the single most important artifact in any human evaluation effort. It is more important than the annotation platform, more important than the annotator selection criteria, more important than the statistical analysis plan. If the rubric is vague, nothing downstream can save the data. Treating annotation as a methodology with explicit guidelines, structured training, and reliability measurement is the established discipline that the article borrows from.⁵

Anatomy of Good Guidelines

Effective guidelines contain four components:

Definition of each label or score. Not just a word. A paragraph. What does "partially complete" mean, precisely? Where is the boundary between "partially complete" and "not at all"?
Positive and negative examples for each label. At least two of each. The examples should be realistic, drawn from actual model outputs, not fabricated to be obviously correct or obviously wrong.
Boundary cases with worked reasoning. These are the cases that fall near the decision boundary. "This response is missing one sub-question out of four. Is that 'fully complete' or 'partially complete'? Per our guidelines, missing any sub-question counts as 'partially complete.' Here is why..."
A decision tree for ambiguous cases. When an annotator is unsure, they need a procedure, not just a definition. "If you cannot determine factual accuracy because you lack domain knowledge, mark 'Cannot determine.' Do not guess."

A Concrete Example

Suppose you are evaluating an LLM-powered customer support assistant. One evaluation dimension is resolution completeness: did the response fully address the customer's issue? Your guidelines might look like this:

## Dimension: Resolution Completeness
##
## Definition:
## A response is "Fully Resolved" if it provides all information
## or actions needed for the customer to solve their problem
## without further interaction.
##
## Labels:
##   FULLY_RESOLVED    - Customer can act on this response alone
##   PARTIALLY_RESOLVED - Some useful info, but follow-up needed
##   NOT_RESOLVED       - Response does not address the issue
##   NOT_APPLICABLE     - Query is not a support request
##
## ---- Example 1: FULLY_RESOLVED ----
## Customer: "How do I reset my password?"
## Response: "Go to Settings > Account > Reset Password.
##   You'll receive an email with a reset link within
##   2 minutes. Check your spam folder if you don't see it."
## Reasoning: Step-by-step instructions, expected timeline,
##   and a common troubleshooting tip. No follow-up needed.
##
## ---- Example 2: PARTIALLY_RESOLVED ----
## Customer: "How do I reset my password?"
## Response: "You can reset your password in Settings."
## Reasoning: Points the customer in the right direction
##   but omits the specific navigation path. Customer will
##   likely need to follow up or search within Settings.
##
## ---- Boundary Case ----
## Customer: "My order hasn't arrived and I need it by Friday."
## Response: "I've checked your order #4521. It shipped on
##   Monday via standard shipping (5-7 business days). It
##   should arrive by Thursday. If it doesn't arrive by
##   Friday, contact us for a full refund."
## Label: FULLY_RESOLVED
## Reasoning: Even though the customer might still not
##   receive the package, the response provides tracking
##   info, an expected date, and a contingency plan.
##   The customer has everything they need.

Notice the level of specificity. The boundary case is doing the heavy lifting. It teaches annotators how to reason about the label, not just how to apply it. Guidelines without boundary cases are guidelines that will produce inconsistent data. The same kind of guideline, when written for crowdworkers, can be directly repurposed as an instruction-tuning prompt, which is a useful reminder that investing in guideline quality pays off in two pipelines, not just one.^H

Likert Scales vs. Binary vs. Ranking

The choice of response format shapes the data you collect, the agreement you observe, and the analyses you can perform. There is no universally correct format. Each has strengths, weaknesses, and contexts where it excels.

Likert Scales (1-5 or 1-7)

Likert scales are the default choice for many evaluation tasks, and often the wrong one. The core problem is calibration: one annotator's "3" is another annotator's "4." This is not a training failure. It reflects genuine differences in how people use ordinal scales. Some annotators cluster around the middle; others use the full range. Some avoid extreme scores; others do not.

Likert scales work best when the dimension being evaluated has a clear, observable gradient. "How many factual errors does this response contain?" maps naturally to a scale. "How helpful is this response?" does not, because "helpful" means different things to different people. The production preference-data standard reflects this: scores are rated on a Likert scale along independent dimensions (helpfulness, honesty, correctness, complexity, verbosity) rather than as a single composite.^I

If you use Likert scales, anchor each point with a concrete description. Not "1 = Poor, 5 = Excellent" but "1 = Contains multiple factual errors that would mislead the user, 5 = All claims are accurate and verifiable."

Binary Labels (Yes/No)

Binary labels are the most reliable response format. Agreement is higher because the decision space is smaller. Analysis is simpler because you are working with proportions rather than ordinal distributions. The tradeoff is information loss: you cannot distinguish between "slightly off" and "catastrophically wrong."

Binary labels are ideal for dimensions with clear thresholds. "Does this response contain a hallucinated fact?" is a natural binary question. "Is this response well-written?" is not. The boundary around "hallucinated fact" itself is contested in the literature: a model generating a date that appears in a database record but happens to be wrong is performing correctly as a language model and incorrectly as a knowledge source, which is a distinction the guidelines need to encode rather than gloss over.^J

A useful pattern: use binary labels for multiple specific dimensions rather than a single Likert scale for overall quality. Five binary questions produce more reliable, more actionable data than one five-point scale.

Ranking (Pairwise or Full)

Ranking asks annotators to compare responses rather than rate them absolutely. "Which response is better, A or B?" sidesteps the calibration problem entirely. Annotators do not need to agree on what "4 out of 5" means. They only need to agree on which response is better. The mathematics underneath this idea is the Bradley-Terry model: ordinal "which is better" judgments can be converted into the cardinal scores needed for reward-model training, which is why pairwise data is the canonical training signal for RLHF.⁷

Pairwise ranking is especially useful for A/B testing between models or between prompt variants. It directly answers the question "Did this change make things better?" without requiring agreement on how much better.

The limitation: ranking produces only ordinal data. You know A is better than B, but not by how much. And ranking does not work when you need absolute quality thresholds, for example, "Is this response safe enough to show to users?"

Three formats, three trade-offs.

In practice, hybrid formats work well. Use binary labels for critical dimensions (safety, factual accuracy) and ranking for subjective quality comparisons. Reserve Likert scales for dimensions where the extra granularity justifies the lower agreement.

Inter-Annotator Agreement

If two annotators look at the same response and produce different labels, something is wrong. But what? The disagreement could be in the annotators, in the guidelines, or in the task itself. Inter-annotator agreement (IAA) metrics help diagnose which.

The Metrics

Cohen's kappa measures agreement between exactly two annotators, correcting for chance agreement. A kappa of 0 means agreement is no better than random; a kappa of 1 means perfect agreement. Values above 0.6 are generally considered "substantial" agreement; above 0.8 is "almost perfect."

Fleiss' kappa extends Cohen's kappa to three or more annotators. It is the standard metric when you have multiple annotators rating the same set of items. The interpretation scale is the same.

Krippendorff's alpha is the most flexible metric. It handles any number of annotators, any measurement scale (nominal, ordinal, interval, ratio), and missing data. If you are going to compute one IAA metric, make it Krippendorff's alpha.¹ The metric you pick is not a casual choice: each one makes different assumptions about chance agreement and label-distribution balance, and the same data can produce noticeably different numbers depending on which metric is reported.⁶

import numpy as np
from sklearn.metrics import cohen_kappa_score
import krippendorff

# Two annotators rating 10 responses on a binary dimension
# 1 = factual error present, 0 = no factual error
annotator_1 = [0, 1, 0, 0, 1, 1, 0, 0, 1, 0]
annotator_2 = [0, 1, 0, 1, 1, 1, 0, 0, 0, 0]

# Cohen's kappa for two annotators
kappa = cohen_kappa_score(annotator_1, annotator_2)
print(f"Cohen's kappa: {kappa:.3f}")
# Output: Cohen's kappa: 0.545

# Krippendorff's alpha (handles any number of annotators)
# Each row is one annotator, each column is one item
# Use np.nan for missing annotations
reliability_data = np.array([
    [0, 1, 0, 0, 1, 1, 0, 0, 1, 0],  # annotator 1
    [0, 1, 0, 1, 1, 1, 0, 0, 0, 0],  # annotator 2
    [0, 1, 1, 0, 1, 1, 0, 0, 1, 0],  # annotator 3
])

alpha = krippendorff.alpha(
    reliability_data=reliability_data,
    level_of_measurement="nominal"
)
print(f"Krippendorff's alpha: {alpha:.3f}")
# Output: Krippendorff's alpha: 0.474

Interpreting Agreement Scores

Low agreement is not annotator failure. It is diagnostic information. When kappa drops below 0.4, the first response should be to examine the guidelines, not to retrain the annotators.

Common causes of low agreement:

Vague label definitions. If "partially complete" is not precisely defined, two reasonable annotators will draw the boundary in different places.
Missing boundary cases. The guidelines may define the prototypical examples well but leave the edge cases undefined.
Too many dimensions evaluated simultaneously. Cognitive overload produces noise. Annotators start applying heuristics rather than following the guidelines.
Genuinely ambiguous items. Some responses are legitimately on the boundary. High disagreement on specific items (not overall) is expected and informative. Flag these for adjudication rather than forcing a consensus.

A practical heuristic: if you run a calibration round and kappa is below 0.6, revise the guidelines before proceeding. The cost of revising guidelines is small. The cost of collecting unreliable data is large.

Annotator Selection and Training

Who does the annotation matters as much as how the annotation task is designed. The two main pools are domain experts and crowd workers, and the choice between them involves tradeoffs in cost, reliability, and validity. The annotation phase is the inverse of the preference-tuning phase that produces aligned models: the same human judgments that train better models also serve as the gold standard for evaluating them, which means annotation quality propagates through the entire system.^K

Domain Experts vs. Crowd Workers

Domain experts (physicians evaluating medical responses, lawyers evaluating legal advice) produce annotations with higher validity. Their judgments reflect actual domain knowledge. The cost is high, often $50 to $100+ per hour, and availability is limited.

Crowd workers (via platforms like Amazon Mechanical Turk, Prolific, or Surge AI) are cheaper and more scalable. For many dimensions, especially those that do not require specialized knowledge (fluency, coherence, tone), crowd workers perform well. The empirical license for this is well-established: aggregated crowdworker annotations can match expert quality on many NLP tasks when quality control is in place.² The key is proper training and quality control.

A middle path: use domain experts to create the guidelines and the gold standard examples, then use trained crowd workers for the bulk of the annotation. The experts define what "correct" looks like; the crowd workers scale the application of that definition. The expert layer matters more than it might appear, because the expert judgments propagate downstream: every model trained on outputs that traced back to expert-shaped preferences inherits the quality of those judgments.^L

The Calibration Phase

Never send annotators directly into production annotation. Start with a calibration phase where all annotators label the same set of 20 to 30 examples; compute IAA, discuss the disagreements, revise the guidelines based on what the disagreements reveal, and repeat until agreement reaches acceptable levels.

This phase typically takes two or three rounds. It feels slow but is the highest-leverage activity in the entire evaluation process. Skipping calibration to save time is a false economy that produces data you cannot trust.

Gold Standard Examples and Ongoing Quality Control

After calibration, quality does not maintain itself automatically. Annotators drift over time, fatigue accumulates, and shortcuts emerge. Ongoing quality monitoring is part of the operating cost of any production annotation pipeline.

The standard approach is to embed gold standard items (also called "trap questions" or "honeypots") into the annotation stream. These are items where the correct label is known. If an annotator's accuracy on gold items drops below a threshold (typically 80-85%), their recent annotations are flagged for review.

def monitor_annotator_quality(
    annotations: list[dict],
    gold_items: dict[str, str],
    accuracy_threshold: float = 0.85
) -> dict:
    """
    Check annotator accuracy against gold standard items.

    Args:
        annotations: List of {item_id, annotator_id, label}
        gold_items: Dict mapping item_id to known-correct label
        accuracy_threshold: Minimum acceptable accuracy

    Returns:
        Dict of annotator_id -> {accuracy, flagged, n_gold}
    """
    annotator_results = {}

    for ann in annotations:
        if ann["item_id"] not in gold_items:
            continue

        aid = ann["annotator_id"]
        if aid not in annotator_results:
            annotator_results[aid] = {"correct": 0, "total": 0}

        expected = gold_items[ann["item_id"]]
        if ann["label"] == expected:
            annotator_results[aid]["correct"] += 1
        annotator_results[aid]["total"] += 1

    return {
        aid: {
            "accuracy": r["correct"] / r["total"] if r["total"] > 0 else 0,
            "flagged": (r["correct"] / r["total"]) < accuracy_threshold
                        if r["total"] > 0 else True,
            "n_gold": r["total"],
        }
        for aid, r in annotator_results.items()
    }

The ratio of gold items to regular items matters. Too few and you cannot detect drift reliably. Too many and you are paying annotators to label items you already know the answer to. A ratio of 10-15% gold items is a reasonable starting point.

Sample Size and Statistical Power

How many examples do you need to annotate? The answer depends on what you are trying to learn and how confident you need to be.

Initial Calibration

For establishing baselines and calibrating automated judges, 100 to 200 annotated examples is a practical minimum. This gives you enough data to estimate quality distributions per dimension, identify common failure modes, and compute preliminary IAA metrics. It is not enough for fine-grained comparisons between conditions. Note that the calibration set must be stratified carefully: in instruction-tuning evaluation, leave-one-out works at the cluster level rather than the task level, because overlapping tasks within a cluster inflate the estimates.^M

Comparative Evaluation (A/B Testing)

When comparing two systems (e.g., model A vs. model B, prompt v1 vs. prompt v2), you need enough samples to detect a meaningful difference. The required sample size depends on the expected effect size and your tolerance for false positives and false negatives.

For pairwise comparison with binary labels ("Which response is better, A or B?"), a rough guideline:

Smaller effects need exponentially more annotators.

A 60/40 split (one model is clearly better on 60% of examples) is a large effect. You can detect it with 100 annotated pairs. A 55/45 split requires 400. A 52/48 split requires 2,500, which is where the cost problem becomes acute.

In practice, if you cannot detect a difference with 100-200 annotated examples, the difference may not be large enough to matter to users. This is a useful heuristic for deciding when to stop annotating.

Multiple Dimensions

If you are evaluating five dimensions independently, you need the above sample sizes per dimension. You can evaluate all dimensions on the same set of examples, but your statistical comparisons need to account for the multiple testing. Bonferroni correction or false discovery rate control keeps your error rates honest.

The Cost Problem

Human evaluation is expensive. A single annotation task with 200 examples, three annotators per example, and crowd worker rates of $15 to $20/hour can easily cost $500 to $1,000. With domain experts at $75 to $100/hour, the same task costs $3,000 to $5,000. And that is one evaluation round for one dimension. The practitioner consensus is that no single evaluation method is sufficient at this cost structure; benchmarks, LLM-as-judge, and human review have to be combined.^N

Multiply by multiple dimensions, multiple evaluation rounds (initial calibration, periodic audits, post-deployment monitoring), and multiple product iterations, and human evaluation becomes one of the largest line items in an LLM team's budget.

Cost is not a reason to skip human evaluation. It is a reason to be strategic about when and how you deploy it.

Tiered Evaluation

Run automated evaluation on everything. Run human evaluation on a strategic sample. The automated layer catches the obvious failures: responses that are off-topic, empty, or contain known error patterns. The human layer handles what automation cannot: subtle quality differences, novel failure modes, subjective dimensions.

A practical tiered approach:

Tier 1 (automated, 100% of traffic): LLM-as-judge scoring on core dimensions. Flag responses below a threshold.
Tier 2 (automated + sampling, 5-10% of traffic): More detailed automated evaluation on a random sample. Stratify by query type, model confidence, or user segment.
Tier 3 (human, 1-2% of traffic): Full human evaluation on a carefully selected subset. Include edge cases, flagged items from Tier 1, and a random sample for calibration.

Funnel narrows as evaluation cost rises.

The same tiered pattern shows up in real deployments. In a clinical trial of LLM-based therapy, investigators scrupulously reviewed every interaction while accepting that this level of oversight would be difficult to scale outside the trial; they explicitly suggest that the documented intervention cases could inform automated guardrails and semi-automated pipelines that flag concerning posts for review, which is the Tier 1 / Tier 2 / Tier 3 architecture under another name.^O

Active Learning for Annotation Selection

Not all examples are equally informative. An example where the automated judge is highly confident (score of 0.99) tells you less than an example where the judge is uncertain (score of 0.52). Active learning selects the most informative examples for human annotation, maximizing the signal per annotation dollar.

def select_for_human_review(
    scored_items: list[dict],
    budget: int,
    uncertainty_weight: float = 0.7,
    random_weight: float = 0.3
) -> list[dict]:
    """
    Select items for human annotation using a hybrid strategy:
    - Most of the budget goes to uncertain items (near decision boundary)
    - Some budget goes to random items (for unbiased calibration)

    Args:
        scored_items: List of {item_id, auto_score, ...}
        budget: Number of items to select
        uncertainty_weight: Fraction of budget for uncertain items
        random_weight: Fraction of budget for random items
    """
    n_uncertain = int(budget * uncertainty_weight)
    n_random = budget - n_uncertain

    # Uncertainty: items closest to the decision boundary (0.5)
    by_uncertainty = sorted(
        scored_items,
        key=lambda x: abs(x["auto_score"] - 0.5)
    )
    uncertain_items = by_uncertainty[:n_uncertain]

    # Random: unbiased sample from the rest
    remaining = [x for x in scored_items if x not in uncertain_items]
    random_items = np.random.choice(
        remaining, size=min(n_random, len(remaining)), replace=False
    ).tolist()

    return uncertain_items + random_items

The random component is essential. If you only annotate uncertain items, you lose the ability to estimate overall system quality. The uncertain items tell you where the automated judge is struggling. The random items tell you whether the judge is calibrated correctly in general.

Periodic Audits Rather Than Continuous Evaluation

Continuous human evaluation is prohibitively expensive for most teams. A more sustainable model conducts thorough human evaluation at specific checkpoints (after a model update, after a significant prompt change, and at least quarterly), and lets the automated systems monitor between audits. The audits validate that the automated monitors are still trustworthy.

Combining Human and Automated Evaluation

The most effective evaluation systems are hybrid. They use automated methods for breadth and speed, and human methods for depth and calibration. The two approaches are not competitors. They are complements, each compensating for the other's weaknesses. The methodology for actually maintaining the human-automated alignment over time, holding out human-graded examples, measuring judge-human agreement, and iterating on the rubric, is the subject of recent work on validating automated evaluators.⁴

The Hybrid Workflow

A mature evaluation pipeline looks like this:

Define quality dimensions with input from domain experts and stakeholders. This is inherently a human activity.
Create annotation guidelines for each dimension. Test them with a calibration round.
Collect human annotations on an initial dataset (100-200 examples). This becomes your ground truth.
Train or calibrate automated judges against the human-annotated ground truth. Measure correlation between automated scores and human labels.
Deploy automated judges for continuous monitoring. They run on every response (or a large sample).
Periodically audit the automated judges with fresh human annotations. If the correlation between human and automated scores has dropped, recalibrate.
Use human evaluation for novel failure modes that the automated judge was not designed to detect. Each new failure mode discovered by humans becomes a new dimension for the automated system.

The feedback loop between steps 6 and 7 is what makes the system improve over time. Human evaluation discovers what the automated system misses, the automated system is updated to catch those failures, and humans then look for the next set of gaps. The cycle continues indefinitely.

Measuring Judge-Human Correlation

When you deploy an automated judge, you need to know how well it agrees with human annotators. This is itself an IAA problem: treat the judge as another annotator and compute agreement metrics.

from scipy.stats import spearmanr, kendalltau
from sklearn.metrics import cohen_kappa_score

def evaluate_judge_alignment(
    human_labels: list,
    judge_labels: list,
    label_type: str = "nominal"
) -> dict:
    """
    Measure how well an automated judge aligns with
    human annotations.

    Args:
        human_labels: Aggregated human labels (majority vote
                      or adjudicated)
        judge_labels: Automated judge predictions
        label_type: "nominal" for categories, "ordinal" for
                    Likert scales
    """
    results = {}

    if label_type == "nominal":
        results["cohens_kappa"] = cohen_kappa_score(
            human_labels, judge_labels
        )
        results["accuracy"] = sum(
            h == j for h, j in zip(human_labels, judge_labels)
        ) / len(human_labels)

    elif label_type == "ordinal":
        rho, p_spearman = spearmanr(human_labels, judge_labels)
        tau, p_kendall = kendalltau(human_labels, judge_labels)
        results["spearman_rho"] = rho
        results["kendall_tau"] = tau
        results["p_value"] = p_spearman

    return results

# Example: Evaluating a judge on factual accuracy (binary)
human = [1, 0, 1, 1, 0, 0, 1, 0, 1, 1]
judge  = [1, 0, 1, 0, 0, 0, 1, 1, 1, 1]

alignment = evaluate_judge_alignment(human, judge, "nominal")
print(f"Judge-Human kappa: {alignment['cohens_kappa']:.3f}")
print(f"Judge accuracy:    {alignment['accuracy']:.1%}")
# Judge-Human kappa: 0.600
# Judge accuracy:    80.0%

A judge-human kappa below 0.6 is a red flag. It means the judge is disagreeing with human annotators on a substantial fraction of examples. Either the judge needs retraining, or the human annotations need to be reviewed for consistency.

Tools and Platforms

You do not need to build annotation infrastructure from scratch. Several platforms provide the scaffolding for structured annotation tasks, and choosing the right one saves significant engineering time.

Label Studio is open-source and highly configurable. It supports custom annotation interfaces, handles task routing, and provides built-in IAA metrics. It is a strong default choice for teams that want to self-host.

Prodigy, from the makers of spaCy, emphasizes active learning workflows. It integrates well with NLP pipelines and is designed for rapid annotation with minimal friction. It is particularly good for binary annotation tasks.

Argilla is designed specifically for LLM evaluation and RLHF data collection. It supports pairwise comparison, Likert scales, and free-text feedback out of the box. Its focus on LLM workflows makes it a natural fit for the use cases discussed in this article. There are three production sources of preference data that a tool like Argilla has to support: direct human annotation, implicit web judgments (vote-based rankings on platforms like StackExchange), and fully synthetic collection that uses an LLM as the annotator. Production teams often draw from all three sources rather than rely on a dedicated annotation campaign for every signal.^P

The platform matters less than the process. Whatever tool you use, it should support three things:

Provenance tracking. Who annotated what, when, and using which version of the guidelines.
IAA computation. Built-in or easy-to-export data for computing agreement metrics.
Task routing. The ability to assign specific items to specific annotators, embed gold items, and balance workload.

If your current tool does not support these three capabilities, you will end up rebuilding them in spreadsheets and scripts. That path leads to errors that are hard to detect and harder to fix.

Common Mistakes

Human evaluation can fail in ways that are subtle and difficult to detect after the fact. The annotations look complete. The numbers look reasonable. But the data is unreliable, and decisions made on that data will be wrong. Some of these failures are inherited from the limits of automated metrics: precision, recall, and F1 can measure surface properties but not the open-ended dimensions that matter most, so common annotation mistakes often start with the impulse to make human evaluation as mechanical as the automated kind.^Q The failure modes below are the ones to watch for.

Vague Guidelines

If annotators can interpret a label definition two different ways, they will. "Somewhat relevant" is not a definition. "Contains at least one piece of information that directly addresses the user's question, but also contains information that does not relate to the question" is a definition. The difference in effort is small. The difference in data quality is enormous.

No Calibration Phase

Sending annotators straight into production is the evaluation equivalent of deploying untested code. The calibration phase catches guideline ambiguities, identifies annotators who misunderstand the task, and establishes baseline agreement levels. Skipping it saves days but wastes weeks of downstream effort when the data turns out to be unusable.

Too Many Dimensions Per Task

Annotator attention is a finite resource. Asking someone to evaluate six dimensions for each of 100 responses produces reliable annotations for the first two dimensions and increasingly noisy annotations for the rest. If you have six dimensions, split them into two or three separate annotation tasks. The total annotation time increases, but the data quality per dimension increases more.

Ignoring Annotator Fatigue

Annotation quality degrades after 60-90 minutes of continuous work. Long sessions without breaks produce data where the first 50 annotations are careful and the last 50 are pattern-matched. Build breaks into the workflow. Limit sessions to 45-60 minutes. Monitor per-annotator agreement over time within a session to detect fatigue effects.

Treating Disagreement as Annotator Error

When two competent annotators disagree, the most likely explanation is that the guidelines do not cover the case. The second most likely explanation is that the item is genuinely ambiguous. The least likely explanation is that one annotator is incompetent. Yet most teams default to the third explanation.

Disagreement is signal. It tells you where the guidelines need revision, where the evaluation task is inherently difficult, and where the quality boundary is fuzzy. Treat disagreement as information, not as noise to be suppressed.

Conflating Reliability with Validity

High inter-annotator agreement does not mean the annotations are correct. Three annotators can reliably agree on labels that are systematically wrong, because the guidelines encode the wrong definition of quality. Reliability (agreement) is necessary but not sufficient. Validity (are we measuring what we think we are measuring?) requires checking that the annotated labels actually predict the outcomes you care about: user satisfaction, task completion, retention. The trap to watch for is a maker's bias toward guidelines that produce high agreement at the cost of oversimplifying the underlying evaluation task.^R One more structural caveat: automated benchmarks like MMLU are vulnerable to data contamination because the questions live on the public web, which gives human evaluation a durable advantage that no benchmark refresh can fully replicate.^S

. . .

Human evaluation is slow, expensive, and irreplaceable. No automated metric can fully substitute for a trained human annotator examining an LLM response and asking, "Would this actually help someone?" The question is not whether to do human evaluation. The question is how to structure it so that every annotation hour produces maximum insight, how to integrate it with automated systems so that humans do only what humans must do, and how to maintain quality over time as guidelines evolve, annotators turn over, and the systems being evaluated change.

Build the guidelines first, calibrate before you scale, and measure agreement relentlessly. When annotators disagree, listen to the disagreement; it is telling you something the metrics cannot.

. . .

References

Textbook grounding, chapter-level citations, and further reading for each numbered reference in this article live on the companion sources page.

Krippendorff, K. (2018). Content Analysis: An Introduction to Its Methodology (4th ed.). Sage Publications. The definitive reference on reliability metrics for content analysis, including the alpha coefficient that bears his name.
Snow, R., O'Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast, but is it good? Evaluating non-expert annotations for natural language tasks. Proceedings of EMNLP 2008. Demonstrated that aggregated crowd worker annotations can match expert quality for many NLP tasks, with appropriate quality control.
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Introduced the LLM-as-judge paradigm and compared automated judge performance to human evaluation, establishing benchmarks for judge-human agreement.
Shankar, S., Zamfirescu-Pereira, J. D., Hartmann, B., Heer, J., & Agrawala, M. (2024). Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences. Examined the alignment gap between automated and human evaluation, proposing methods for systematic validation of automated judges.
Hovy, E. & Lavid, J. (2010). Towards a 'science' of corpus annotation: A new methodological challenge for corpus linguistics. International Journal of Translation, 22(1). A foundational paper on annotation methodology, emphasizing the importance of guidelines, training, and reliability measurement.
Artstein, R. & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4), 555-596. Comprehensive survey of agreement metrics, their assumptions, and appropriate use cases.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37-46. The original paper introducing Cohen's kappa, the most widely used pairwise agreement metric.