← Back to article

Sources

Grounding, citations, and further reading for LLM-as-Judge: Using Models to Evaluate Models.

All of this is optional. The article itself is the tutorial. This page exists for readers who want to follow the citation trail back to the primary sources and read deeper into the survey literature.

Nothing on this page is required reading, and you do not need to purchase any of these books. Numbered references in the article hyperlink to the corresponding entries here.

About the Sources

Zheng et al.: Judging LLM-as-a-Judge (anchor paper)

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). NeurIPS 2023. arXiv:2306.05685.

The reference paper that established the modern LLM-as-judge paradigm. Introduces MT-Bench and Chatbot Arena, documents the agreement rate between GPT-4 and human expert judges (over 80%), and catalogs the systematic biases (position, verbosity, self-enhancement) that frame every conversation about judge reliability. Available at arxiv.org/abs/2306.05685.

Liu et al.: G-Eval

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2023). EMNLP 2023. arXiv:2303.16634.

The G-Eval paper. Demonstrates that having the judge generate chain-of-thought reasoning before assigning a score raises correlation with human judgments above any prior automatic metric, including supervised metrics trained specifically for the evaluation task. Available at arxiv.org/abs/2303.16634.

Dubois et al.: Length-Controlled AlpacaEval

Dubois, Y., Galambosi, B., Liang, P., & Hashimoto, T. B. (2024). ICML 2024. arXiv:2404.04475.

Documents verbosity bias quantitatively and proposes a length-controlled correction. Foundational for any pairwise evaluation pipeline because it shows how the length confound silently inflates win rates for the more verbose model. Available at arxiv.org/abs/2404.04475.

Shankar et al.: Who Validates the Validators?

Shankar, V., Yeh, C., & Liang, P. (2024). arXiv:2404.12272.

Names the alignment gap between automated and human evaluation and proposes methods for systematic validation of automated judges. Useful as the methodological grounding for the calibration section. Available at arxiv.org/abs/2404.12272.

Li et al.: AlpacaEval

Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., & Hashimoto, T. B. (2023). GitHub repository.

The earlier AlpacaEval benchmark that paired LLM-as-judge with a fixed reference model. The combination of fixed-reference plus pairwise judgment became the template most production evaluation pipelines now follow. Available at arxiv.org/abs/2306.05087.

Es et al.: RAGAS

Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). arXiv:2309.15217.

Reference-free LLM-as-judge framework specifically for retrieval-augmented generation. Decomposes RAG evaluation into faithfulness, answer relevance, and context relevance, each scored by a judge model with no human gold standard required. Available at arxiv.org/abs/2309.15217.

Liu et al.: Lost in the Middle

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). arXiv:2307.03172.

Documents the U-shaped attention curve in long-context models. Relevant to LLM-as-judge because long rubrics or long evaluation contexts can produce position-dependent judge behavior. Available at arxiv.org/abs/2307.03172.

SLP3: Jurafsky & Martin

Jurafsky, Daniel & James H. Martin. Speech and Language Processing, 3rd ed. (draft).

The standard academic textbook for NLP. Freely available in draft form at web.stanford.edu/~jurafsky/slp3/. Chapter 9 is the canonical formal treatment of preference learning and reward modeling, and most of the textbook notes on this page cite specific equations and sections from that chapter.

Widdows & Cohen: Large Language Models: How They Work and Why They Matter

Widdows, Dominic & Trevor Cohen. SemanticVectors Publishing, 2025.

Mathematically grounded survey of LLM architecture and behavior. Strongest on instruction-following, sycophancy, and the historical evaluation traditions (Cranfield, TREC) that LLM-as-judge inherits. Cited several times below.

Alammar & Grootendorst: Hands-On Large Language Models

Alammar, Jay & Maarten Grootendorst. O'Reilly Media, 2024.

Practitioner-oriented survey. Useful for the pragmatic discussions of DPO preference tuning and the recommendation that benchmarks, LLM-as-judge, and human evaluation be combined as complementary signals rather than treated as substitutes.

The Evaluation Bottleneck

AThe Cranfield-to-BLEU genealogy of automatic metrics

Widdows and Cohen trace quantitative evaluation in NLP back to Cleverdon's Cranfield experiments in the 1960s, which established precision, recall, and the F-measure as the standard metrics for information retrieval. The TREC conferences formalized shared-task evaluation with leaderboards from 1992 onward. BLEU, ROUGE, and F1 are direct descendants of that tradition, and their breakdown on fluent LLM outputs echoes a longstanding observation: getting good results on one challenge does not always mean a system will adapt reliably to new tasks.

Widdows & Cohen, Ch. 2.3.3 ("Search Evaluation").

↩ Back to article

BWhy intrinsic metrics fail across tokenizers

Jurafsky and Martin describe the same evaluation breakdown for intrinsic metrics in SLP3 §7.6.1. Perplexity, the standard intrinsic metric for language models, depends on the number of tokens in a text, which makes it "very sensitive to differences in the tokenization algorithm." Perplexities from two language models with very different tokenizers are not directly comparable. This is a concrete example of the proxy breakdown that motivated the move to LLM-as-judge: even supposedly objective metrics fail when the systems being compared differ in fundamental architecture choices.

SLP3 §7.6.1. Read SLP3

↩ Back to article

Why It Works

1The over-80% GPT-4 / human agreement number

Zheng et al. report that GPT-4 achieves over 80% agreement with human expert judges on open-ended quality assessments on MT-Bench and Chatbot Arena, a rate comparable to the agreement between different human annotators. The paper is the empirical foundation for the practical claim made in this article: LLM judges are not perfect, but they are consistent and directionally correct, which is what evaluation pipelines need.

Zheng et al. (2023), Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685

↩ Back to article

CWhy preference-trained models can serve as judges at all

Jurafsky and Martin formalize the mechanism behind LLM-as-judge in SLP3 §9.2 and §9.3. The reward model trained on human preferences (via Bradley-Terry cross-entropy loss, Eq. 9.3) learns a scalar function r(x,o) that scores prompt/output pairs. This reward model is then used to align the LLM via RLHF or DPO (§9.3.2). When an aligned model serves as a judge, it is effectively deploying the reward signal that was baked into it during preference-based training. The judge's ability to evaluate quality is not incidental; it is a direct consequence of the alignment procedure.

SLP3 §9.2, §9.3.2, Eq. 9.3. Read SLP3

↩ Back to article

DHow little training data the preference signal needs

Widdows and Cohen describe the instruction-following transformation in Ch. 5.2.3. A base LLaMA model trained on 4 TB of text was converted into an instruction-follower using only ~40 MB of prompt-response pairs from GPT-4, which had itself been trained to produce responses preferred by human raters. The authors call it "remarkable how little additional training data are required," which reinforces the structural claim that the preference signal baked into instruction-tuned models is what makes LLM-as-judge viable at all.

Widdows & Cohen, §5.2.3.

↩ Back to article

Rubric Design

ERubric design borrows from crowdworker annotation guidelines

Jurafsky and Martin demonstrate good rubric design implicitly in SLP3 §9.1.1 (Fig. 9.5). The crowdworker annotation guideline reproduced there specifies exact answer types ("span," "date," "number"), formatting constraints, and explicit boundary handling for partial dates ("if full date is not available in the passage you can write partial date such as 1992 or Jan 1992"). These properties, specificity, anchoring, behavioral descriptions, and exhaustive edge-case coverage, are the same properties that produce reliable LLM judge evaluations. The parallel is not accidental.

SLP3 §9.1.1, Fig. 9.5. Read SLP3

↩ Back to article

Scoring Strategies

FThe Bradley-Terry foundation of pairwise comparison

Jurafsky and Martin ground pairwise comparison in the Bradley-Terry model (SLP3 §9.2.2): the probability that output o_i is preferred over o_j is the logistic sigmoid of their latent score difference. Annotators never need to assign absolute scores; they only need to express binary preferences. The model derives cardinal scores from ordinal judgments. This is the mathematical basis for the empirical observation that pairwise LLM-as-judge evaluations are more stable than pointwise scoring: the judge only needs to determine relative ordering, which it can do more reliably than assigning a number on an anchored scale.

SLP3 §9.2.2. Read SLP3

↩ Back to article

5AlpacaEval as the production pairwise template

The Li et al. AlpacaEval benchmark paired LLM-as-judge with a fixed reference model so that a system under test could be evaluated against the same baseline every time. The fixed-reference plus pairwise judgment pattern became the production template most LLM teams now use for A/B testing and release gating. AlpacaEval predates the Length-Controlled correction (Dubois et al., 2024) that fixed the verbosity confound discussed in the bias section.

Li et al. (2023), AlpacaEval. arXiv:2306.05087

↩ Back to article

G-Eval and Chain-of-Thought Judging

2G-Eval and the chain-of-thought judging upgrade

The Liu et al. G-Eval paper introduces a single modification to the LLM-as-judge prompt: the judge generates a chain-of-thought reasoning trace about how the rubric applies before assigning a score. In the paper's experiments, GPT-4 + G-Eval achieves higher correlation with human judgments than any prior automatic evaluation method, including supervised metrics trained specifically for the evaluation task. The mechanism is the same one that powers chain-of-thought reasoning in general: forcing the model to engage with the specific content of the output rather than producing a gut-reaction score.

Liu et al. (2023), G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv:2303.16634

↩ Back to article

GChain-of-thought as "think before you speak"

Widdows and Cohen discuss chain-of-thought prompting in Ch. 5.2.4, describing it as automating "the process of breaking a problem into basic steps" and characterizing the techniques as "successful implementations of the proverbial advice 'think before you speak.'" They show concrete examples where zero-shot chain-of-thought prompting improved a LLaMA-3 model from estimating four to five palindromic primes to correctly identifying around fifteen to twenty, by forcing step-by-step decomposition. G-Eval exploits the same mechanism for evaluation.

Widdows & Cohen, §5.2.4.

↩ Back to article

Position Bias and Other Failure Modes

1Position bias quantified in Zheng et al.

Zheng et al. document position bias across multiple judge models. When the same pair of responses was presented in both orders, the judge's preference changed in a significant fraction of cases, with the first-position advantage ranging from a few percentage points (subtle) to over twenty percentage points (severe), depending on the model and the similarity of the two responses. The mitigation (run each comparison in both orders and discard inconsistent verdicts) is now standard in serious pairwise pipelines.

Zheng et al. (2023). arXiv:2306.05685

↩ Back to article

HOrdering biases as a structural property of language models

Widdows and Cohen independently document ordering biases in language models in Ch. 6.1.2. As early as 2005, models were found to encode strong ordering preferences like "salt and pepper" versus "pepper and salt," including hierarchical and gender-based orderings ("men and women" but "Ladies and gentlemen"). This suggests position bias in LLM judges may be a deep structural property of how language models encode sequence, not merely a superficial prompt artifact.

Widdows & Cohen, §6.1.2.

↩ Back to article

4Length-Controlled AlpacaEval and verbosity bias

Dubois et al. quantify verbosity bias on AlpacaEval and propose a length-controlled correction. Their analysis shows that a non-trivial share of "wins" in pairwise LLM-as-judge evaluation is attributable to length alone, not substance. This is the empirical grounding for the article's claim that verbosity bias is particularly dangerous because it incentivizes the wrong behavior (verbose outputs, not better ones) during model development.

Dubois et al. (2024), Length-Controlled AlpacaEval. arXiv:2404.04475

↩ Back to article

IWhy verbosity bias likely emerges from RLHF

Jurafsky and Martin provide a formal account of why verbosity bias likely emerges from RLHF. In SLP3 §9.3 (Eq. 9.5), the RLHF objective includes a KL divergence penalty to keep the model close to the reference policy. Without that constraint, the model would forget what it learned during pretraining as it pivots to seeking high rewards. Verbosity bias in judges likely reflects a reward-hacking dynamic from that training: longer responses may have accumulated more positive preference signals during RLHF, encoding a length-quality association in the model's internalized reward function.

SLP3 §9.3, Eq. 9.5. Read SLP3

↩ Back to article

JSycophancy as a deeper explanation of self-enhancement bias

Widdows and Cohen discuss LLM sycophancy in Ch. 6.1.1, defined as the tendency of LLMs trained as assistants to agree with viewpoints presented to them. If an LLM judge exhibits sycophancy toward the outputs it evaluates, it would systematically inflate scores rather than critically assess quality. This gives a deeper explanation for self-enhancement bias: it may not be self-preference in the narrow sense, but the broader sycophantic tendency baked into assistant-trained models.

Widdows & Cohen, §6.1.1.

↩ Back to article

KConfident hallucination as a documented failure mode

Widdows and Cohen illustrate this failure mode vividly in Ch. 6.1.1. They show a Galactica model generating an authoritative-sounding scientific abstract claiming Ivermectin effectively treats COVID-19, complete with clinical language and confident assertions, all entirely fabricated. They note that "plausibility in and of itself can be persuasive and text that appears to come from an authoritative source could lead to misguided and even harmful medical decisions." An LLM judge evaluating such output faces the same trap: fluent, confident prose that is substantively wrong.

Widdows & Cohen, §6.1.1.

↩ Back to article

Multi-Dimensional Evaluation

LMulti-aspect Likert scoring as the industry standard

Jurafsky and Martin describe the industry standard for multi-dimensional evaluation in SLP3 §9.2.1. Current preference datasets rate outputs on Likert scales across distinct aspects: helpfulness, honesty, correctness, complexity, and verbosity. This mirrors the multi-dimensional LLM-as-judge pattern exactly: instead of a single "which is better" signal, evaluators produce a vector of aspect scores. Annotators rating model outputs in isolation along independent aspects avoid the cost of extensive pairwise comparisons.

SLP3 §9.2.1. Read SLP3

↩ Back to article

Calibration: Anchoring Judge Scores

3Aligning automated evaluators with human preference

The Shankar et al. paper names the validation problem in its title: who validates the validators? The paper proposes a methodology for systematically aligning LLM-assisted evaluation of LLM outputs with human preferences, including the use of held-out human-graded examples for measuring judge-human agreement and iterating on rubrics until calibration metrics cross practitioner thresholds. This is the methodological backbone of the calibration section.

Shankar et al. (2024), Who Validates the Validators? arXiv:2404.12272

↩ Back to article

MDPO and the model-as-evaluator continuum

Alammar and Grootendorst describe Direct Preference Optimization (DPO) as a method that uses accepted/rejected response pairs to align models. This is directly related to the model-as-evaluator concept: the same preference data that trains a model to produce better outputs can calibrate a judge model to distinguish better from worse responses. The boundary between "model" and "judge" is not architectural, it is a consequence of prompting strategy and which side of the pipeline the model sits on.

Alammar & Grootendorst, Ch. 12.

↩ Back to article

NData contamination as a hidden calibration risk

Jurafsky and Martin describe a striking example of the calibration problem in SLP3 §7.6.2: data contamination. Since models train on web data and benchmarks like MMLU (15,908 questions in 57 areas) are publicly available, models may incorporate some MMLU questions into their training. If those questions are then used for evaluation, the metric overstates the performance. The same risk applies to LLM-as-judge calibration: if the judge was trained on data that overlaps with the calibration set, agreement metrics will be inflated. Fresh, held-out calibration data is essential.

SLP3 §7.6.2. Read SLP3

↩ Back to article

OThe "always a rubric problem" claim has limits

The article's claim that low judge-human agreement is "always a rubric problem, not a model problem" may be too strong. Widdows and Cohen describe a maker's bias in Ch. 1.4: as machine learning engineers, we want our models to be valid and valuable, which can make us eager to believe that the world is more like the simple situation rather than the muddle. They note this manifests as "a bias in favor of more biased models." In calibration, the same maker's bias could lead a team to blame the rubric when the underlying model genuinely lacks the capacity to evaluate certain dimensions.

Widdows & Cohen, §1.4.

↩ Back to article

The Economics

PWhere the cost actually lives

Jurafsky and Martin give helpful cost context in SLP3 §9.1. Instruction tuning is "much more modest than the training of base LLMs," typically involving several epochs over instruction datasets numbering in the thousands, and the overall cost is "a small fraction of the original cost to train the base model." This reframes the economics of LLM-as-judge: the alignment training that makes models capable judges is relatively cheap; the real expense is running judges at inference time across thousands of evaluations.

SLP3 §9.1. Read SLP3

↩ Back to article

QA worked example of cheap-judge calibration

Raschka uses Llama 3 (8B) via Ollama to score instruction-tuned model responses on a 0 to 100 scale against test references. His fine-tuned GPT-2 medium averaged around 50; the Llama 3 instruct baseline scored around 82.6. This is a concrete demonstration of both the promise and the calibration challenge of cheap-judge evaluation: the numbers are usable as a ranking signal, but they need anchoring against human or stronger-judge scores before they can be interpreted as absolute quality.

Raschka, Ch. 7.

↩ Back to article

Building an Evaluation Pipeline

RDPO blurs the line between generator and evaluator

Jurafsky and Martin describe DPO in SLP3 §9.3.2 as an approach that eliminates the explicit reward model entirely, using a closed-form solution for the reward function in terms of the optimal policy. This is relevant to evaluation pipeline design: if a model can internalize preference judgments without a separate reward model, the same model can serve as both generator and evaluator in a pipeline. The boundary between "the model" and "the judge" is not architectural; it is a consequence of prompting strategy.

SLP3 §9.3.2. Read SLP3

↩ Back to article

When LLM Judges Fail

SCombining benchmarks, judges, and human evaluation

Alammar and Grootendorst stress that evaluation remains challenging and that no single metric works for all use cases. They recommend combining benchmarks, LLM-as-judge, and human evaluation as complementary signals. Over-reliance on any one approach creates blind spots, which is why the article's failure-mode section is paired with a hybrid-approach recommendation.

Alammar & Grootendorst, Ch. 12.

↩ Back to article

TClinical-trial-level oversight in high-stakes domains

Widdows and Cohen reinforce the high-stakes point in Ch. 6.2 with a concrete clinical example. LLM-based therapy in a research trial required investigators to "scrupulously review all interactions," providing safety outreach in fifteen incidents involving risk of self-harm and corrections in thirteen cases of out-of-scope medical advice. They note this level of oversight "would likely be difficult to scale outside the context of a clinical trial." This illustrates why LLM judges cannot be the sole arbiter in high-stakes domains: human oversight remains essential even when the model performs well on average.

Widdows & Cohen, §6.2.

↩ Back to article

6RAGAS and reference-free RAG evaluation

Es et al. define RAGAS, a reference-free framework that uses LLM-as-judge to score retrieval-augmented generation along three dimensions: faithfulness to the retrieved context, answer relevance to the question, and context relevance to the question. It is the canonical example of a domain-specific LLM-as-judge pipeline that bypasses the gold-reference bottleneck. Relevant to the "novel reasoning" failure mode discussion because reference-free evaluation pushes more weight onto the rubric and the judge's reasoning.

Es et al. (2023), RAGAS. arXiv:2309.15217

↩ Back to article

7Long-context attention and judge reliability

Liu et al. document the "Lost in the Middle" phenomenon: language models exhibit a U-shaped attention curve where information at the beginning and end of long contexts is recalled more reliably than information in the middle. Relevant to LLM-as-judge because long rubrics, long evaluation contexts, and long output windows can produce position-dependent judge behavior independent of the well-studied A-vs-B position bias in pairwise comparison.

Liu et al. (2023), Lost in the Middle. arXiv:2307.03172

↩ Back to article

UWhy automated evaluation is becoming necessary, not merely useful

Widdows and Cohen offer a broader historical perspective in Ch. 6.3. They observe that as tools reduce the human effort needed to complete different tasks, how we evaluate one another will change, in practical and formal ways. Written examinations were designed when coherent written text was scarce and hard to produce, an assumption that is now obsolete. This suggests LLM-as-judge is not merely a technical convenience but part of a deeper shift: as LLMs make text generation trivial, the very nature of evaluation must evolve, and automated evaluation becomes necessary rather than optional.

Widdows & Cohen, §6.3.

↩ Back to article