← All Articles

LLM-as-Judge: Using Models to Evaluate Models

Language models outpaced the evaluation methods built for them. Human raters are expensive and slow, and traditional metrics like BLEU and ROUGE miss what matters. The working solution most production teams converged on is to let one model grade another, with the discipline that the judge runs in a fresh session against the output presented cold so the verdict is not confirmation-biased.

The Evaluation Bottleneck

In 2020, evaluating a language model meant running it against a benchmark. BLEU for translation, ROUGE for summarization, and F1 for question answering. Each metric was imperfect, but each was automatic, reproducible, and cheap; you could evaluate a million outputs overnight. These metrics are direct descendants of the Cranfield-tradition information-retrieval measures (precision, recall, F-measure) that quantitative NLP evaluation has used since the 1960s.^A

Then the models got good enough that these metrics stopped working. A BLEU score measures n-gram overlap with a reference translation. It cannot distinguish between a fluent paraphrase and a clumsy literal rendering, because both might share the same fraction of trigrams with the gold standard. ROUGE counts shared subsequences between a generated summary and a reference. It penalizes a summary that captures every key point using different words, and rewards one that copies surface phrases while missing the meaning entirely.

The fundamental problem is that these metrics compare strings, not semantics. When model outputs were rough and formulaic, string comparison was a reasonable proxy for quality. When outputs became fluent, varied, and creative, the proxy broke down. A model could score poorly on BLEU while producing objectively better translations than the reference. Intrinsic metrics like perplexity have an analogous fragility: they depend on the tokenizer, so two models with different tokenization cannot be compared on the same scale.^B

Human evaluation remains the gold standard, but it has a scaling problem that gets worse every year. A single annotation takes minutes. Training annotators on rubrics takes hours. Achieving inter-annotator agreement on subjective dimensions like "helpfulness" or "safety" takes weeks of iteration. At the rate modern systems are developed, tested, and iterated, human evaluation becomes a bottleneck that slows everything downstream.

LLM-as-judge fills the gap. Send the output to a strong model along with evaluation criteria, and get a structured judgment back. It is not as reliable as a careful human annotator. It is orders of magnitude faster, cheaper, and more consistent. For most practical purposes, that tradeoff is decisive.

The Core Pattern

The basic LLM-as-judge pattern is disarmingly simple. You construct a prompt that contains the output to evaluate, the criteria for evaluation, and instructions for how to structure the judgment. You send this prompt to a capable model. You parse the response.

Here is the minimal version:

↗ docs
from openai import OpenAI
import json

client = OpenAI()

def judge(output, criteria):
    """Send an output to an LLM judge and get a score back."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": (
                "You are an expert evaluator. Score the following "
                "output on a scale of 1 to 5 based on the given "
                "criteria. Return JSON with 'score' (integer 1-5) "
                "and 'reasoning' (string)."
            )
        }, {
            "role": "user",
            "content": (
                f"Criteria: {criteria}\n\n"
                f"Output to evaluate:\n{output}"
            )
        }],
        response_format={"type": "json_object"},
        temperature=0.0,
    )
    return json.loads(response.choices[0].message.content)

# Example usage
result = judge(
    output="Photosynthesis converts sunlight into chemical energy. "
           "Plants absorb CO2 and release oxygen as a byproduct.",
    criteria="Accuracy and completeness of the scientific explanation. "
             "A score of 5 means fully correct and comprehensive. "
             "A score of 1 means mostly incorrect or missing key details.",
)
print(f"Score: {result['score']}/5")
print(f"Reasoning: {result['reasoning']}")

That is the entire pattern. Everything else in this article is about making it work reliably: designing better rubrics, choosing scoring strategies, mitigating biases, calibrating against human judgments, and building this into an automated pipeline.

The same pattern works with Anthropic's API:

↗ docs
from anthropic import Anthropic

client = Anthropic()

def judge_with_claude(output, criteria):
    """Use Claude as an LLM judge."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system=(
            "You are an expert evaluator. Score the following "
            "output on a scale of 1 to 5 based on the given "
            "criteria. Return JSON with 'score' (integer 1-5) "
            "and 'reasoning' (string). Return only valid JSON."
        ),
        messages=[{
            "role": "user",
            "content": (
                f"Criteria: {criteria}\n\n"
                f"Output to evaluate:\n{output}"
            )
        }],
    )
    return json.loads(response.content[0].text)

Why It Works

Using an LLM to evaluate another LLM's output sounds circular. It is not, for two reasons.

First, there is a fundamental asymmetry between generation and evaluation. Generating a correct, well-structured response to an open-ended question is hard. Checking whether a given response meets specific criteria is much easier. This mirrors a well-known principle in computer science: verification is easier than generation. An LLM that might hallucinate during open-ended generation can still reliably assess whether a given text is coherent, relevant, or factually grounded in a provided reference.

Second, modern LLMs are specifically trained on human preference data through RLHF and similar techniques. They have internalized millions of human quality judgments. When you ask a model "which of these two responses is better?", you are asking it to perform exactly the kind of comparative assessment that constituted a significant portion of its training signal. The model is not improvising a theory of quality. It is pattern-matching against a vast corpus of human preferences. The mechanism is formal: a reward model trained on preference data learns a scalar function over prompt-output pairs, and that signal becomes part of the aligned model's behavior.^C The data requirement is surprisingly modest, on the order of tens of megabytes of high-quality preference pairs are enough to transform a base model into a competent instruction-follower and, by extension, a competent judge.^D

The empirical evidence is strong. A capable frontier model achieves over 80% agreement with human expert judges on open-ended quality assessments, comparable to the agreement rate between different human annotators.¹ The judges are not perfect, but they are consistent and directionally correct, which is what matters for most evaluation use cases.

The key insight is this: you do not need the judge to be right about every individual output. You need it to be right about the aggregate. If your system produces a thousand responses and the judge correctly identifies 820 of them as good or bad, you have a reliable signal for tracking quality over time, comparing system versions, and catching regressions. The 18% error rate washes out in aggregate.

Rubric Design

The rubric is the prompt. This is the single most important idea in LLM-as-judge, and the one most often neglected. Teams that invest heavily in model selection and scoring strategy while treating the rubric as an afterthought are optimizing the wrong thing.

A vague rubric produces noisy, unreliable scores. Consider this criterion:

"Rate the quality of this response from 1 to 5."

What does "quality" mean? Accuracy? Fluency? Helpfulness? Conciseness? The judge model will impose its own interpretation, and that interpretation will vary with context, phrasing, and even the specific examples in the batch. Two evaluations of the same output might produce different scores simply because the model interpreted "quality" differently each time.

Now compare this rubric:

"Rate the factual accuracy of this response on a scale of 1 to 5, where:
5 = Every factual claim is correct and verifiable
4 = All major claims are correct; minor details may be imprecise
3 = The core answer is correct but contains at least one significant error
2 = The response contains multiple factual errors that undermine its usefulness
1 = The response is predominantly incorrect or fabricated"

This rubric is specific, observable, and anchored with examples at each score level. The judge no longer has to decide what "quality" means; it counts errors and assesses their severity against the rubric, and the result is dramatically more consistent scores.

Properties of Good Rubrics

Good rubrics share four properties. They are specific, targeting a single observable dimension rather than a vague composite. They are anchored, providing concrete descriptions of what each score level looks like. They are behavioral, describing things that can be observed in the text rather than inferred about the author's intent. And they are exhaustive, covering the full range of possible quality levels so the judge never has to extrapolate. These are the same properties that distinguish reliable human annotation guidelines: specificity, anchoring, behavioral descriptions, and exhaustive edge-case coverage are what make crowdworker annotations consistent enough to use as training data for the very models being evaluated.^E

Here is a rubric for evaluating the helpfulness of a customer support response:

RUBRIC: Customer Support Response Helpfulness

Score the response on a scale of 1 to 5:

5 - Directly addresses the customer's specific issue.
    Provides a clear, actionable solution or next step.
    Uses appropriate tone (empathetic, professional).
    Example: "I see the charge on your account from Oct 15.
    I've issued a refund of $24.99 which will appear in 3-5
    business days. Is there anything else I can help with?"

4 - Addresses the issue and provides a solution, but may
    require a minor follow-up for clarification.
    Example: "I can help with that refund. Could you confirm
    the transaction date so I can process it right away?"

3 - Partially addresses the issue but misses key details,
    or provides a generic rather than specific response.
    Example: "I understand your concern about the charge.
    Let me look into this for you and get back to you."

2 - Acknowledges the customer's message but fails to move
    toward a resolution. Overly generic or off-topic.
    Example: "Thank you for reaching out. We value your
    feedback and strive to provide the best experience."

1 - Does not address the customer's issue. May be
    irrelevant, confusing, or inappropriate in tone.
    Example: "Please see our FAQ at example.com/faq."

Notice how each score level includes a concrete example. The judge can compare the actual response against these anchors and place it accordingly. This eliminates most of the subjectivity that plagues unanchored scales.

Scoring Strategies

There are three fundamental approaches to LLM-as-judge scoring: pointwise, pairwise, and reference-based. Each has distinct strengths, and the right choice depends on what question you are trying to answer.

Pointwise Scoring

Pointwise scoring rates each response independently on a numeric scale. This is the most common approach and the one shown in the examples above. You give the judge one output and ask for a score.

Strengths: simple to implement, easy to aggregate, and produces absolute scores that can be tracked over time. If your system scored 3.8 last month and 4.1 this month, you know it improved.

Weaknesses: LLM judges exhibit scale compression. They tend to cluster scores in the middle of the range, avoiding extreme ratings. A 1-5 scale often collapses into a de facto 3-4 scale. This makes it hard to distinguish between mediocre and good outputs. You can mitigate this with detailed rubric anchoring, but it never fully goes away.

↗ docs
def pointwise_judge(output, rubric):
    """Rate a single output on a 1-5 scale."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": (
                "You are an expert evaluator. Use the rubric below "
                "to score the output. First, reason step by step "
                "about which score level best matches the output. "
                "Then provide your final score.\n\n"
                "Return JSON: {'reasoning': '...', 'score': N}\n\n"
                f"RUBRIC:\n{rubric}"
            )
        }, {
            "role": "user",
            "content": f"Output to evaluate:\n{output}"
        }],
        response_format={"type": "json_object"},
        temperature=0.0,
    )
    return json.loads(response.choices[0].message.content)

Pairwise Comparison

Pairwise comparison shows the judge two outputs and asks which one is better. Instead of assigning an absolute score, the judge makes a relative judgment. This is the format that powers public benchmarks like Chatbot Arena and the AlpacaEval family, which now sit underneath most production A/B-testing pipelines.⁵

Strengths: humans are naturally better at comparison than absolute rating, and LLMs inherit this property. Pairwise judgments are more consistent than pointwise scores. They also avoid the scale compression problem entirely, since there is no scale. The mathematical reason pairwise comparison works is that it reduces a hard absolute-judgment problem to an easier ordinal one: the Bradley-Terry model can convert binary "A is better than B" preferences into cardinal quality scores without anyone ever having to assign a number to a response.^F

Weaknesses: pairwise comparison does not produce absolute scores, so you cannot directly track quality over time. It also scales quadratically. Comparing N systems requires N(N-1)/2 pairwise evaluations per test case. For A/B testing between two system versions, this is fine. For ranking ten candidate systems, it becomes expensive.

↗ docs
def pairwise_judge(question, response_a, response_b, criteria):
    """Compare two responses and pick the better one."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": (
                "You are an expert evaluator. You will be shown a "
                "question and two responses (A and B). Determine "
                "which response is better based on the criteria.\n\n"
                "Think step by step. Then provide your verdict.\n\n"
                "Return JSON: {'reasoning': '...', "
                "'winner': 'A' or 'B' or 'tie'}\n\n"
                f"Criteria: {criteria}"
            )
        }, {
            "role": "user",
            "content": (
                f"Question: {question}\n\n"
                f"Response A:\n{response_a}\n\n"
                f"Response B:\n{response_b}"
            )
        }],
        response_format={"type": "json_object"},
        temperature=0.0,
    )
    return json.loads(response.choices[0].message.content)

# Example: comparing two summarization approaches
result = pairwise_judge(
    question="Summarize the key findings of the Q3 report.",
    response_a="Revenue increased 12% to $4.2B. Margins improved.",
    response_b="The Q3 report showed strong results across all "
               "metrics, with particularly notable improvements "
               "in the enterprise segment driving growth.",
    criteria="Specificity, accuracy, and informativeness.",
)
print(f"Winner: {result['winner']}")

Reference-Based Scoring

Reference-based scoring compares the output against a gold-standard reference answer. The judge assesses how well the output covers the same information as the reference, using the reference as an anchor rather than relying solely on the rubric.

Strengths: this is the most reliable approach when you have high-quality reference answers. It grounds the evaluation in concrete expected content rather than abstract criteria. It is especially useful for factual tasks where there is a clearly correct answer.

Weaknesses: you need reference answers, which are expensive to create. And the approach penalizes valid alternative formulations. A response that is correct but structured differently from the reference may be scored lower than it deserves.

↗ docs
def reference_judge(question, output, reference):
    """Score an output by comparing it to a reference answer."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": (
                "You are an expert evaluator. Compare the output "
                "to the reference answer and score it 1-5.\n\n"
                "5 = Covers all key points from the reference, "
                "with no significant errors or omissions.\n"
                "4 = Covers most key points. Minor omissions.\n"
                "3 = Covers some key points but misses important "
                "information from the reference.\n"
                "2 = Misses most key points or contains errors.\n"
                "1 = Does not match the reference in content or "
                "contains significant factual errors.\n\n"
                "The output does NOT need to use the same wording. "
                "Evaluate semantic coverage, not surface similarity.\n\n"
                "Return JSON: {'reasoning': '...', 'score': N, "
                "'covered_points': [...], 'missed_points': [...]}"
            )
        }, {
            "role": "user",
            "content": (
                f"Question: {question}\n\n"
                f"Reference answer:\n{reference}\n\n"
                f"Output to evaluate:\n{output}"
            )
        }],
        response_format={"type": "json_object"},
        temperature=0.0,
    )
    return json.loads(response.choices[0].message.content)

Choosing a Strategy

Three scoring strategies side by side.

In practice, many evaluation pipelines combine strategies. Use pairwise comparison for system-level decisions (which model version is better?) and pointwise scoring for per-example diagnostics (which specific outputs need improvement?). Use reference-based scoring for your golden test set where you have curated answers.

G-Eval and Chain-of-Thought Judging

G-Eval is the technique that significantly improves LLM judge quality through a simple modification: ask the judge to think before scoring.²

The standard approach sends the output and criteria to the judge and asks for a score. G-Eval adds an intermediate step. First, the judge generates a chain-of-thought reasoning about how the criteria apply to the specific output. Then, conditioned on that reasoning, it produces a score. This mirrors how a careful human evaluator would work: read the rubric, examine the output, think about how the rubric applies, then assign a number. The underlying mechanism is the same one that powers chain-of-thought prompting in general: forcing a model to decompose a problem into intermediate steps before committing to an answer.^G

The improvement is substantial. G-Eval with a frontier judge model achieves higher correlation with human judgments than any earlier automatic evaluation method, including supervised metrics trained specifically for the evaluation task. The chain-of-thought forces the model to engage with the specific content of the output rather than producing a gut-reaction score.

The implementation requires only a prompt change:

↗ docs
def geval_judge(output, criteria, evaluation_steps):
    """G-Eval: Chain-of-thought evaluation for better scores."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": (
                "You are an expert evaluator. You will evaluate "
                "the given output using specific evaluation steps.\n\n"
                f"CRITERIA: {criteria}\n\n"
                "EVALUATION STEPS:\n"
                + "\n".join(
                    f"{i+1}. {step}"
                    for i, step in enumerate(evaluation_steps)
                )
                + "\n\nFollow these steps carefully. Write your "
                "detailed analysis for each step, then provide "
                "a final score from 1 to 5.\n\n"
                "Return JSON: {'step_analysis': ['...', '...'], "
                "'score': N}"
            )
        }, {
            "role": "user",
            "content": f"Output to evaluate:\n{output}"
        }],
        response_format={"type": "json_object"},
        temperature=0.0,
    )
    return json.loads(response.choices[0].message.content)

# Example: G-Eval for summarization quality
result = geval_judge(
    output="The study found that exercise improves memory "
           "and reduces anxiety in adults over 65.",
    criteria="Coherence and informativeness of the summary.",
    evaluation_steps=[
        "Check if the summary captures the main finding.",
        "Check if the summary includes the study population.",
        "Check if the summary is free of unsupported claims.",
        "Assess whether the summary is concise but complete.",
        "Assign a score based on overall quality (1-5).",
    ],
)
print(f"Score: {result['score']}")
for i, analysis in enumerate(result["step_analysis"]):
    print(f"Step {i+1}: {analysis}")

The evaluation steps are the critical design decision. Each step should target a specific, observable aspect of quality. Steps that are too broad ("Assess the overall quality") provide no benefit over a simple prompt. Steps that are too narrow ("Count the number of sentences") turn the evaluation into a mechanical checklist that misses holistic quality. The sweet spot is steps that direct attention without constraining judgment.

Position Bias and Other Failure Modes

LLM judges are not neutral. They carry systematic biases that, if unaddressed, can corrupt your evaluation results. Understanding these biases is essential for building trustworthy evaluation pipelines.

Position Bias

In pairwise comparisons, LLM judges disproportionately prefer whichever response appears first. This is known as position bias. The effect has been documented across multiple judge models: when the same pair of responses is presented in both orders, the judge's preference changes in a significant fraction of cases, with the first-position advantage ranging from a few percentage points to over twenty percentage points depending on model and content similarity.¹ The effect may run deeper than a prompt artifact: language models have long encoded sequence-order preferences ("salt and pepper" reliably beating "pepper and salt"), which suggests judge position bias is partly a structural property of how the underlying model represents order.^H

The mitigation is straightforward. Run every pairwise comparison twice, once with A first and once with B first. If the judge picks the same winner in both orderings, the judgment is reliable. If the winner changes, mark the comparison as a tie or discard it. This doubles your cost but eliminates one of the largest sources of noise.

Swap-and-reconcile cancels position bias.

import random

def debiased_pairwise_judge(question, response_a, response_b, criteria):
    """Run pairwise comparison in both orders to cancel position bias."""

    # First ordering: A then B
    result_ab = pairwise_judge(question, response_a, response_b, criteria)

    # Second ordering: B then A
    result_ba = pairwise_judge(question, response_b, response_a, criteria)

    # Map result_ba winner back to original labels
    ba_winner_mapped = {
        "A": "B",  # Judge picked "A" = response_b
        "B": "A",  # Judge picked "B" = response_a
        "tie": "tie",
    }[result_ba["winner"]]

    # Check for consistency
    if result_ab["winner"] == ba_winner_mapped:
        return {
            "winner": result_ab["winner"],
            "confident": True,
            "reasoning_ab": result_ab["reasoning"],
            "reasoning_ba": result_ba["reasoning"],
        }
    else:
        return {
            "winner": "tie",
            "confident": False,
            "note": "Position bias detected; results inconsistent.",
        }

Verbosity Bias

Verbosity bias causes LLM judges to prefer longer responses, even when the additional length adds no substance. A two-paragraph answer that repeats itself will often outscore a concise one-paragraph answer that says the same thing more efficiently. This bias has been quantified on AlpacaEval, where a non-trivial share of "wins" is attributable to length alone, and a length-controlled correction shifts the rankings meaningfully.⁴ The bias is particularly dangerous because it incentivizes exactly the wrong behavior: training or tuning your system to be verbose rather than precise. The likely cause is reward-hacking from RLHF training itself, where longer responses accumulated more positive preference signals and a length-quality association became baked into the model's reward function.^I

Mitigation: include explicit instructions in your rubric that length should not influence the score. Better yet, include rubric anchors where a concise response scores higher than a verbose one. You can also add a separate "conciseness" dimension to your evaluation and penalize outputs that exceed a reasonable length for the task.

Self-Enhancement Bias

Models tend to rate their own outputs more favorably than outputs from other models. This self-consistency bias means that if you use one model family to judge its own outputs, the scores will be systematically inflated compared to using the same judge on a different model's outputs. This bias is smaller than position bias or verbosity bias, but it is measurable. A plausible deeper cause is sycophancy: assistant-trained models tend to agree with viewpoints presented to them, and an LLM judge inherits that tendency when assessing outputs.^J

The practical implication: when comparing outputs from different models, use a third model as the judge. Or use multiple judges and aggregate their scores. Either approach reduces the influence of any single model's self-preference.

Confident-Sounding Wrong Answers

LLM judges struggle to distinguish between confidently stated correct information and confidently stated incorrect information. A response that asserts a false claim with authority and supporting detail will often receive a higher score than one that hedges appropriately around the edges of its knowledge. This failure mode has been demonstrated dramatically: a Galactica-style model can generate a fluent, authoritative-looking scientific abstract claiming an effective treatment that does not exist, and a judge evaluating the abstract on surface fluency will not catch it.^K

This is the most dangerous bias, because it rewards the exact failure mode that matters most in production: confident hallucination. The mitigation is to include factual verification as an explicit evaluation step. Provide ground truth or reference material. Do not rely on the judge's own knowledge to assess factual accuracy.

Multi-Dimensional Evaluation

Real-world outputs rarely need evaluation on a single axis. A customer support response needs to be accurate, empathetic, actionable, and safe. A code generation output needs to be correct, efficient, readable, and well-documented. Collapsing all of these dimensions into a single score loses the diagnostic information that makes evaluation useful. The industry standard for preference data now reflects this: production preference datasets rate outputs on independent Likert scales for distinct aspects such as helpfulness, honesty, correctness, complexity, and verbosity, rather than a single composite score.^L

Multi-dimensional evaluation scores each output on several independent criteria, producing a vector of scores rather than a scalar. This costs more per evaluation (one LLM call per dimension, or one structured call that assesses all dimensions at once), but the diagnostic value is worth it.

↗ docs
def multi_dimensional_judge(question, output, dimensions):
    """Evaluate an output on multiple independent dimensions."""

    # Build dimension descriptions for the prompt
    dim_text = "\n".join(
        f"- {name}: {desc} (score 1-5)"
        for name, desc in dimensions.items()
    )

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": (
                "You are an expert evaluator. Score the output "
                "on each dimension independently. For each dimension, "
                "first explain your reasoning, then assign a score.\n\n"
                f"DIMENSIONS:\n{dim_text}\n\n"
                "Return JSON with this structure:\n"
                "{\n"
                "  'evaluations': {\n"
                "    'dimension_name': {\n"
                "      'reasoning': '...',\n"
                "      'score': N\n"
                "    },\n"
                "    ...\n"
                "  }\n"
                "}"
            )
        }, {
            "role": "user",
            "content": (
                f"Question: {question}\n\n"
                f"Output to evaluate:\n{output}"
            )
        }],
        response_format={"type": "json_object"},
        temperature=0.0,
    )
    return json.loads(response.choices[0].message.content)

# Example: evaluating a code generation output
code_dimensions = {
    "correctness": "Does the code produce the correct output for "
                   "the given specification? Are edge cases handled?",
    "efficiency": "Is the algorithmic complexity reasonable? Does "
                  "the code avoid unnecessary operations?",
    "readability": "Is the code well-structured with clear variable "
                   "names, appropriate comments, and logical flow?",
    "robustness": "Does the code handle invalid inputs, empty cases, "
                  "and boundary conditions gracefully?",
}

result = multi_dimensional_judge(
    question="Write a function to find the longest palindrome substring.",
    output="def longest_palindrome(s):\n    if not s: return ''\n    ...",
    dimensions=code_dimensions,
)

# Print per-dimension results
for dim, eval_result in result["evaluations"].items():
    print(f"{dim}: {eval_result['score']}/5")
    print(f"  {eval_result['reasoning']}\n")

The per-dimension breakdown transforms evaluation from a pass/fail signal into an actionable diagnostic. If correctness is high but readability is low, you know to adjust your prompt to encourage clearer code. If robustness is consistently the weakest dimension, you know to add explicit instructions about error handling. A single aggregate score would hide these patterns entirely.

Calibration: Anchoring Judge Scores

An LLM judge that produces consistent scores is useful. An LLM judge whose scores align with human judgment is trustworthy. The difference between the two is calibration. The methodology for closing this gap, holding out human-graded examples, computing judge-human agreement, and iterating on the rubric until calibration metrics cross practitioner thresholds, has been formalized in recent work on validating automated evaluators.³

Calibration requires a small set of human-graded examples: outputs that have been scored by human experts using the same rubric you give to the LLM judge. You run the judge on these same examples and compare its scores to the human scores. The agreement between the two tells you how much to trust the judge's evaluations on unlabeled data. The same preference-pair training data that aligns generator models can also calibrate judges, since the underlying signal (accepted-versus-rejected response pairs) is the same on both sides of the pipeline.^M

Measuring Agreement

Two metrics are standard for measuring judge calibration. Spearman's rank correlation measures whether the judge ranks outputs in the same order as humans. A correlation of 0.8 or above indicates strong alignment. Cohen's kappa measures agreement on categorical judgments (e.g., pass/fail, A/B/tie), adjusted for chance agreement. A kappa of 0.6 or above is generally considered acceptable; 0.8 or above is strong.

from scipy.stats import spearmanr
from sklearn.metrics import cohen_kappa_score
import numpy as np

def calibrate_judge(human_scores, judge_scores):
    """Measure alignment between human and LLM judge scores."""

    # Spearman rank correlation
    correlation, p_value = spearmanr(human_scores, judge_scores)

    # Cohen's kappa (discretize scores into bins for kappa)
    human_bins = ["low" if s <= 2 else "mid" if s <= 3 else "high"
                  for s in human_scores]
    judge_bins = ["low" if s <= 2 else "mid" if s <= 3 else "high"
                  for s in judge_scores]
    kappa = cohen_kappa_score(human_bins, judge_bins)

    # Mean absolute error
    mae = np.mean(np.abs(
        np.array(human_scores) - np.array(judge_scores)
    ))

    return {
        "spearman_correlation": round(correlation, 4),
        "p_value": round(p_value, 6),
        "cohens_kappa": round(kappa, 4),
        "mean_absolute_error": round(mae, 4),
    }

# Example: check calibration on 20 human-graded examples
human = [5, 4, 3, 5, 2, 4, 1, 3, 4, 5,
         2, 3, 4, 5, 1, 4, 3, 2, 5, 4]
llm    = [5, 4, 3, 4, 3, 4, 2, 3, 4, 5,
         2, 4, 4, 5, 1, 3, 3, 2, 4, 4]

calibration = calibrate_judge(human, llm)
print(f"Spearman correlation: {calibration['spearman_correlation']}")
print(f"Cohen's kappa: {calibration['cohens_kappa']}")
print(f"Mean absolute error: {calibration['mean_absolute_error']}")

When Calibration Fails

Low agreement between the judge and human annotators is usually a rubric problem rather than a model problem. If the judge and humans disagree, it often means the rubric is ambiguous enough that reasonable evaluators interpret it differently. The first fix is to refine the rubric: add more specific anchors, clarify edge cases, provide additional examples. That said, the "always a rubric problem" framing has limits: a maker's bias can lead teams to blame the rubric when the underlying judge model genuinely lacks the capacity to evaluate certain dimensions, and it is worth keeping that possibility on the table.^O

A useful debugging technique is to examine the specific examples where the judge and humans diverge. If the judge consistently scores higher than humans on a particular type of output, the rubric probably has a gap that the judge is interpreting more generously. If divergence is random, the rubric is simply too vague to produce consistent judgments from any evaluator, human or machine. Watch for one more failure mode: if the judge was trained on data that overlaps with the calibration set, judge-human agreement metrics will be inflated by contamination rather than by genuine alignment.^N

Aim to calibrate with at least 50 human-graded examples. Fewer than that and your agreement metrics will have wide confidence intervals that make it hard to distinguish real misalignment from sampling noise. Recalibrate whenever you change the rubric, switch judge models, or significantly modify the system being evaluated.

The Economics

LLM-as-judge is cheaper than human evaluation, but it is not free. At scale, the cost of evaluation can rival the cost of the system being evaluated. The real expense is not the alignment training that makes the judge possible (which is a small fraction of base-model training cost), but the inference cost of running the judge across thousands of evaluations.^P Understanding the cost structure helps you make smart decisions about when and how to evaluate.

Cost per Evaluation

A typical pointwise evaluation with a detailed rubric consumes roughly 500-1,000 input tokens (rubric + output) and 200-500 output tokens (reasoning + score). With chain-of-thought evaluation, output tokens increase to 500-1,500.

Same workload, 20x cost spread.

The figures are approximate and will vary based on rubric length and output complexity. The key observation: there is a 10-20x cost difference between frontier models and their smaller variants. For a pipeline evaluating 10,000 outputs per day across four dimensions, that difference is the gap between $640/day and $28/day.

When to Use a Cheaper Judge

Not every evaluation needs GPT-4o or Claude Sonnet. Smaller, cheaper models are often sufficient for well-defined evaluation tasks. The rule of thumb: the vaguer the rubric and the more subjective the judgment, the more you benefit from a stronger judge. The more specific and binary the criterion, the more a smaller model can handle it.

Binary classifications ("Does this response contain personal information? Yes or no.") can often be handled by GPT-4o-mini or Claude Haiku with minimal quality loss. Nuanced, multi-dimensional assessments ("Rate the pedagogical effectiveness of this explanation on a 1-5 scale") benefit substantially from a frontier model.

A practical approach is to run a calibration study on your specific task with both a strong and a weak judge. If the correlation between their scores is above 0.9, use the cheaper model. If it drops below 0.8, the quality loss is not worth the savings. A worked example: Raschka uses Llama 3 8B via Ollama to score instruction-tuned model outputs against test references, scoring a fine-tuned GPT-2 medium at around 50 and a Llama 3 instruct baseline at around 82.6 on a 0 to 100 scale, which is enough to rank systems but only after anchoring against a stronger judge or a human reference.^Q

Batch Processing

Both OpenAI and Anthropic offer batch APIs that provide 50% cost reductions in exchange for higher latency (results within 24 hours rather than seconds). For evaluation pipelines that run nightly or weekly, batch processing halves your costs with no functional downside.

↗ docs
import json

def prepare_batch_evaluations(outputs, rubric):
    """Prepare a batch file for OpenAI's Batch API."""
    requests = []
    for i, output in enumerate(outputs):
        requests.append({
            "custom_id": f"eval-{i}",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gpt-4o",
                "messages": [{
                    "role": "system",
                    "content": f"Evaluate using this rubric:\n{rubric}\n"
                              "Return JSON: {'score': N, 'reasoning': '...'}"
                }, {
                    "role": "user",
                    "content": f"Output:\n{output}"
                }],
                "response_format": {"type": "json_object"},
                "temperature": 0.0,
            }
        })

    # Write JSONL file for batch submission
    with open("eval_batch.jsonl", "w") as f:
        for req in requests:
            f.write(json.dumps(req) + "\n")

    return "eval_batch.jsonl"

# Submit batch (50% cheaper, results within 24 hours)
batch_file = prepare_batch_evaluations(
    outputs=["Response 1...", "Response 2...", "..."],
    rubric="Rate factual accuracy 1-5. 5=all claims correct...",
)

# Upload and submit via OpenAI Batch API
file_obj = client.files.create(
    file=open(batch_file, "rb"),
    purpose="batch"
)
batch = client.batches.create(
    input_file_id=file_obj.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)
print(f"Batch submitted: {batch.id}")

Building an Evaluation Pipeline

Individual evaluations produce individual scores. A pipeline connects them into a continuous quality signal: generate test cases, run the system, judge the outputs, aggregate scores, track trends, and alert on regressions. The boundary between "model" and "judge" in such a pipeline is not architectural; it is a function of which prompt the model is currently seeing. The same model that generates outputs can evaluate them, and recent preference-tuning methods like DPO blur the line further by collapsing reward modeling into the policy update itself.^R

↗ docs
import json
from datetime import datetime
from pathlib import Path
from statistics import mean
from openai import OpenAI

client = OpenAI()


class EvaluationPipeline:
    """End-to-end LLM evaluation pipeline."""

    def __init__(self, system_under_test, test_cases, dimensions):
        self.system = system_under_test
        self.test_cases = test_cases
        self.dimensions = dimensions
        self.results_dir = Path("eval_results")
        self.results_dir.mkdir(exist_ok=True)

    def generate_outputs(self):
        """Run the system on all test cases."""
        outputs = []
        for case in self.test_cases:
            output = self.system(case["input"])
            outputs.append({
                "id": case["id"],
                "input": case["input"],
                "output": output,
                "reference": case.get("reference"),
            })
        return outputs

    def judge_output(self, output_record):
        """Evaluate a single output on all dimensions."""
        dim_text = "\n".join(
            f"- {name}: {desc}"
            for name, desc in self.dimensions.items()
        )

        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "system",
                "content": (
                    "Evaluate the output on each dimension (1-5). "
                    "Reason step by step for each.\n\n"
                    f"DIMENSIONS:\n{dim_text}\n\n"
                    "Return JSON: {'scores': {'dim': N, ...}, "
                    "'reasoning': {'dim': '...', ...}}"
                )
            }, {
                "role": "user",
                "content": (
                    f"Input: {output_record['input']}\n\n"
                    f"Output: {output_record['output']}"
                )
            }],
            response_format={"type": "json_object"},
            temperature=0.0,
        )
        return json.loads(response.choices[0].message.content)

    def run(self, run_label=None):
        """Execute the full pipeline: generate, judge, aggregate."""
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

        # Step 1: Generate outputs
        outputs = self.generate_outputs()

        # Step 2: Judge each output
        evaluations = []
        for output_record in outputs:
            judgment = self.judge_output(output_record)
            evaluations.append({
                "id": output_record["id"],
                "input": output_record["input"],
                "output": output_record["output"],
                "scores": judgment["scores"],
                "reasoning": judgment["reasoning"],
            })

        # Step 3: Aggregate scores
        aggregate = {}
        for dim in self.dimensions:
            scores = [e["scores"][dim] for e in evaluations
                      if dim in e["scores"]]
            aggregate[dim] = round(mean(scores), 3) if scores else None

        # Step 4: Save results
        report = {
            "timestamp": timestamp,
            "label": run_label,
            "aggregate": aggregate,
            "evaluations": evaluations,
            "n_examples": len(evaluations),
        }
        path = self.results_dir / f"eval_{timestamp}.json"
        with open(path, "w") as f:
            json.dump(report, f, indent=2)

        return report

The pipeline structure is deliberately simple. Complexity should live in the rubrics and dimensions, not in the orchestration code. Each run produces a timestamped JSON file that captures everything: the inputs, outputs, per-example scores, reasoning, and aggregates. This makes it trivial to debug specific failures, track trends, and reproduce past evaluations.

Tracking Trends and Catching Regressions

A single evaluation snapshot tells you where you stand, but a series of snapshots tells you where the system is headed, which is the harder question for any team planning a release.

import json
from pathlib import Path

def load_history(results_dir):
    """Load all evaluation runs for trend analysis."""
    history = []
    for path in sorted(Path(results_dir).glob("eval_*.json")):
        with open(path) as f:
            run = json.load(f)
        history.append({
            "timestamp": run["timestamp"],
            "label": run.get("label"),
            **run["aggregate"],
        })
    return history

def check_regression(history, dimension, threshold=0.05):
    """Alert if a dimension has dropped from its recent peak."""
    if len(history) < 2:
        return None

    recent = history[-1][dimension]
    previous_best = max(h[dimension] for h in history[:-1])

    if previous_best - recent > threshold:
        return {
            "dimension": dimension,
            "current": recent,
            "previous_best": previous_best,
            "drop": round(previous_best - recent, 4),
        }
    return None

# Example: check all dimensions for regressions
history = load_history("eval_results")
for dim in ["accuracy", "helpfulness", "safety", "relevance"]:
    alert = check_regression(history, dim)
    if alert:
        print(
            f"REGRESSION: {alert['dimension']} dropped "
            f"from {alert['previous_best']} to "
            f"{alert['current']} ({alert['drop']} drop)"
        )

Integrate this into your CI/CD pipeline. Every pull request that touches your model configuration, prompt templates, or retrieval logic should trigger an evaluation run against your golden test set. If any dimension regresses beyond a threshold, fail the build. This is how you prevent the slow, invisible quality erosion that plagues systems without continuous evaluation.

When LLM Judges Fail

LLM-as-judge is powerful, but it is not universally applicable. There are domains and dimensions where it breaks down, and using it anyway produces a dangerous illusion of rigor. The general principle the production community has converged on is to combine benchmarks, LLM-as-judge, and human evaluation as complementary signals rather than treat them as substitutes for one another.^S

High-Stakes Factual Domains

In medical, legal, and financial applications, the cost of a single wrong evaluation can be enormous. An LLM judge that confidently scores an incorrect medical recommendation as "accurate" has not just failed at evaluation; it has validated a dangerous output that might reach a patient. Clinical trials of LLM-based therapy have required scrupulous human review of every interaction, with safety outreach in cases of self-harm risk and corrections in cases of out-of-scope advice, a level of oversight that would be difficult to scale outside a research setting.^T

The problem is not that LLM judges are bad at factual assessment. They are reasonably good at it. The problem is that "reasonably good" is insufficient when individual errors carry high consequences. In high-stakes domains, LLM judges serve as a first-pass filter, not a final arbiter. Use them to catch obvious failures and flag outputs for human review. Never use them as the sole quality gate for outputs that could cause harm.

Reference-Free Domain-Specific Pipelines

One way to push LLM-as-judge into domains where gold-standard references are scarce is to decompose evaluation into reference-free sub-dimensions. RAGAS does this for retrieval-augmented generation, scoring outputs on faithfulness to retrieved context, answer relevance, and context relevance without requiring any human-curated answer.⁶ The pattern transfers: pick a small number of judgeable sub-properties whose conjunction approximates quality, and have the judge evaluate each independently.

Novel Reasoning

LLM judges can assess whether an output follows known patterns. They struggle to evaluate genuinely novel reasoning, unusual approaches, or creative solutions that diverge from conventional answers. A mathematical proof that uses an unconventional technique might be scored lower by the judge simply because it does not match the expected pattern, even if the proof is correct.

This limitation is inherent to the approach. The judge is, by definition, pattern-matching against its training data. Outputs that fall outside those patterns receive unreliable evaluations. For tasks that require evaluating creativity or novel problem-solving, human judgment remains essential. There is also a context-window effect worth flagging: when rubrics or evaluation contexts are long, judges exhibit a U-shaped attention curve where information in the middle of the prompt is recalled less reliably than information at the beginning or end.⁷

Cultural and Subjective Dimensions

Quality judgments that depend on cultural context, audience expectations, or subjective preferences are poorly served by LLM judges. What constitutes "appropriate tone" for a customer support response varies by culture, company, and individual customer. An LLM judge trained primarily on English-language data may apply Western communication norms to evaluate responses intended for a very different audience.

Subjective dimensions like "creativity," "engagement," or "humor" are similarly problematic. The judge will produce scores, but those scores reflect the aggregate preferences in its training data, not the specific preferences of your target audience.

The Hybrid Approach

The answer in all of these cases is the same: use LLM judges for coverage, and use human evaluation for calibration and edge cases.

LLM judges evaluate every output, every day, at scale. They catch the 80% of failures that are obvious: hallucinations, off-topic responses, safety violations, gross factual errors. Human evaluators review a sample of outputs weekly, focusing on the cases that the LLM judge found ambiguous, the cases from high-stakes categories, and a random sample to check for systematic blind spots.

The LLM judge tells you what the system is doing. The human evaluators tell you whether the judge is right about it. Neither alone is sufficient. Together, they provide evaluation coverage that would be impossible to achieve with either approach in isolation. The broader trajectory matters here: as LLMs make coherent text generation trivial, the nature of evaluation itself is shifting, and automated evaluation is becoming necessary rather than merely useful.^U

. . .

References

Textbook grounding, chapter-level citations, and further reading for each numbered reference in this article live on the companion sources page.

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023.
Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2023). "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment." EMNLP 2023.
Shankar, V., Yeh, C., & Liang, P. (2024). "Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences." arXiv:2404.12272.
Dubois, Y., Galambosi, B., Liang, P., & Hashimoto, T. B. (2024). "Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators." ICML 2024.
Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., & Hashimoto, T. B. (2023). "AlpacaEval: An Automatic Evaluator of Instruction-Following Models." GitHub repository.
Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). "RAGAS: Automated Evaluation of Retrieval Augmented Generation." arXiv:2309.15217.
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172.