Sources
Grounding, citations, and further reading for Human Evaluation Frameworks for LLM Systems.
All of this is optional. The article itself is the tutorial. This page exists for readers who want to follow the citation trail back to the primary sources and read deeper into the annotation-methodology literature.
Nothing on this page is required reading, and you do not need to purchase any of these books. Numbered references in the article hyperlink to the corresponding entries here.
About the Sources
Krippendorff: Content Analysis (anchor reference)
The definitive reference on reliability metrics for content analysis, including the alpha coefficient that bears Krippendorff's name. The text walks the assumptions behind every standard agreement metric, explains why kappa fails on imbalanced label distributions, and grounds the design choices any serious human-evaluation pipeline has to make.
Snow et al.: Cheap and Fast, But Is It Good?
The empirical foundation for the modern crowd-annotation workflow. Demonstrated that aggregated crowdworker annotations can match expert quality on many NLP tasks when quality control is in place. Available at aclanthology.org/D08-1027.
Zheng et al.: Judging LLM-as-a-Judge
Introduces the LLM-as-judge paradigm and compares automated judge performance to human evaluation, establishing the benchmarks for judge-human agreement that frame the calibration discussion in this article. Available at arxiv.org/abs/2306.05685.
Shankar et al.: Who Validates the Validators?
Examines the alignment gap between automated and human evaluation and proposes a methodology for systematic validation of automated judges. The paper articulates why human evaluation cannot be fully replaced by automated judges, even sophisticated ones. Available at arxiv.org/abs/2404.12272.
Hovy & Lavid: corpus annotation methodology
Foundational paper on annotation methodology. Lays out the case that annotation is itself a science, emphasizing the importance of guidelines, training, and reliability measurement as load-bearing artifacts rather than overhead. Available at link.springer.com.
Artstein & Poesio: inter-coder agreement survey
Comprehensive survey of agreement metrics in computational linguistics, their assumptions, and appropriate use cases. The canonical reference when deciding which metric to compute for which annotation task. Available at doi.org/10.1162/coli.07-034-R2.
Cohen: the original kappa paper
The original paper introducing Cohen's kappa, the most widely used pairwise agreement metric. Worth reading once for the historical context, even though Krippendorff (above) is the more practical day-to-day reference.
SLP3: Jurafsky & Martin
The standard academic textbook for NLP. Freely available in draft form at web.stanford.edu/~jurafsky/slp3/. Chapter 9 is the canonical formal treatment of preference data, reward modeling, and the Bradley-Terry mathematics behind pairwise ranking; Chapter 7 covers data contamination and benchmark methodology. Cited several times below.
Widdows & Cohen: Large Language Models: How They Work and Why They Matter
Mathematically grounded survey of LLM architecture and behavior. Strong on the historical evaluation traditions (Cranfield, TREC) that human-evaluation pipelines inherit, and on the safety concerns (hallucination, sycophancy, clinical-trial-grade oversight) that make human evaluation non-negotiable in high-stakes domains. Cited several times below.
Alammar & Grootendorst: Hands-On Large Language Models
Practitioner-oriented survey. Useful for the discussion of DPO preference tuning and for the recommendation that benchmarks, LLM-as-judge, and human evaluation be combined as complementary signals rather than treated as substitutes.
The Irreducible Role of Human Judgment
ACranfield, TREC, and the Cleverdon tradition
Widdows and Cohen describe how information retrieval was guided by quantitative evaluation from the 1960s onward. Cleverdon's team at Cranfield University had human experts comb document collections to create relevance judgments as ground truth, which then became the standard for comparing automated retrieval systems. The pattern is the same one this article describes: humans define what "correct" looks like, and automated systems are measured against that human baseline.
Widdows & Cohen, §2.3.3 ("Search Evaluation").
↩ Back to articleBWhy preference judgments anchor the entire post-training stack
Jurafsky and Martin formalize this dependency in SLP3 §9.2. Preference-based learning exists precisely because human judgments anchor the entire post-training pipeline. The textbook notes that "unlike instructions, preference judgments do not require people to know how to do something. We simply have to have an opinion about the end result." This is the formal basis for why human evaluation cannot be eliminated: the preference signal that trains and aligns LLMs originates from human annotators, and the quality of that signal determines model quality downstream.
SLP3 §9.2. Read SLP3
↩ Back to articleWhen Human Evaluation Is Necessary
CHallucination, toxicity, and sycophancy as failure modes
Jurafsky and Martin catalog these safety risks formally in SLP3 §7.7. LLMs are "prone to saying things that are false, a problem called hallucination." Models can generate text that is "dangerous," "false," and "toxic," and pre-LLM systems like Siri and Alexa already gave medical advice that "if actually taken, would have led to harm or death." The chapter also discusses sycophancy, where models agree with users rather than correct them, a particularly dangerous failure mode in clinical settings. These are the failure modes that demand human evaluation in high-stakes domains.
SLP3 §7.7. Read SLP3
↩ Back to articleDClinical-trial-grade oversight for high-stakes LLMs
Widdows and Cohen reinforce this point in Ch. 1 and Ch. 6. They draw a parallel to clinical trials: medicine has rigorous statistical processes for demonstrating that treatments work, but "machine learning products, like other web technologies, have never been expected to undergo anything like clinical trials." They argue that as LLMs enter clinical settings, properly-conducted trials "might become an important safeguard." In Ch. 6.1.1, they show how the Galactica model generated false medical claims about Ivermectin in the style of a scientific abstract, underscoring the shared-blind-spots risk that high-stakes evaluation has to defend against.
Widdows & Cohen, Ch. 1 and §6.1.1.
↩ Back to articleEWhy benchmarks miss novel tasks
Widdows and Cohen make a related point about benchmark limitations in Ch. 2: "Getting good results at one challenge doesn't always mean a system will adapt reliably to new tasks." They describe how TREC-style shared evaluations helped standardize comparisons, but also note that having common benchmarks "is much more reliable than having each research effort choose a preferred definition of 'best'." A therapy-summary task is precisely the kind of novel application that falls outside existing benchmarks and demands fresh human definition.
Widdows & Cohen, Ch. 2.
↩ Back to article3LLM-as-judge inherits its calibration from humans
Jurafsky and Martin describe the UltraFeedback dataset in SLP3 §9.2.1, where preference judgments were "generated by prompting outputs from a diverse set of LLMs and then prompting GPT-4 to rank the outputs for each prompt." This is LLM-as-judge used to generate training data, not just evaluation data. Even when LLMs serve as judges, the original calibration signal traces back to human preferences: GPT-4's ability to rank outputs was itself shaped by human preference data during its own RLHF training. See also Zheng et al. (2023) for the modern benchmarks of judge-human agreement.
SLP3 §9.2.1; Zheng et al. (2023). arXiv:2306.05685
↩ Back to articleFSycophancy and subjective evaluation
Widdows and Cohen discuss LLM sycophancy in Ch. 6.1.1: the tendency of LLMs "trained as assistants to agree with viewpoints presented to them." This matters for subjective evaluation. An LLM judge evaluating "empathy" or "tone" may exhibit the same sycophantic tendencies, rating responses favorably because they sound agreeable rather than because they are genuinely empathetic. Human evaluation of subjective dimensions cannot be fully delegated to LLM judges for this reason.
Widdows & Cohen, §6.1.1.
↩ Back to articleThe Annotation Pipeline
GThe Cranfield precision/recall decomposition
Widdows and Cohen give a useful historical example of the decomposition principle in Ch. 2.3.3. The Cranfield experiments separated precision (the proportion of retrieved documents that are relevant) from recall (the proportion of relevant documents that are retrieved), noting that "errors come in different forms, and making some errors go away makes others more likely." The F-measure combines them, but the underlying appreciation of distinct failure modes remains. This is the same insight as decomposing "overall quality" into factual accuracy, tone, completeness, and so on.
Widdows & Cohen, §2.3.3.
↩ Back to article5Annotation methodology as a science
Hovy and Lavid lay out the case that annotation deserves to be treated as a methodology rather than as overhead. They emphasize three artifacts: explicit guidelines, structured annotator training, and reliability measurement as a precondition for treating annotated data as ground truth. This is the methodological backbone of every recommendation in the article's annotation-pipeline section.
Hovy & Lavid (2010). link.springer.com
↩ Back to articleWriting Annotation Guidelines
HAnnotation guidelines double as instruction-tuning prompts
Jurafsky and Martin show in SLP3 §9.1.1 (Fig. 9.5) that crowdworker annotation guidelines can be directly repurposed as instruction-tuning prompts. The detailed guideline from the NaturalInstructions dataset specifies answer types ("span," "date," "number"), formatting rules, and edge-case handling. The same guideline structures that produce reliable annotations also produce reliable instruction-tuning data. The practitioner implication: investing in well-structured annotation guidelines pays dividends twice, once for evaluation quality and once for training data quality.
SLP3 §9.1.1, Fig. 9.5. Read SLP3
↩ Back to articleLikert Scales vs. Binary vs. Ranking
IMulti-aspect Likert as the production standard
Jurafsky and Martin describe the industry-standard use of Likert scales for LLM preference data in SLP3 §9.2.1. Annotators rate outputs on a Likert scale (0 to 4) along distinct dimensions: helpfulness, honesty, correctness, complexity, and verbosity. This multi-aspect Likert approach is used in datasets like HH-RLHF and LMSYS. The key design decision: rating dimensions independently avoids the calibration problem that single-score Likert suffers from. Aspect-level scores preserve diagnostic information that a composite score loses.
SLP3 §9.2.1. Read SLP3
↩ Back to articleJThe contested boundary around "hallucinated fact"
Widdows and Cohen complicate the "hallucinated fact" binary in Ch. 6.1.1. They argue that the boundary between hallucination and acceptable generation is blurrier than it appears: "In a more traditional design, factuality and fluency were separate responsibilities." A language model generating "J.S. Bach was born in 1985" from a database record containing 1985 is working perfectly as a language model. The term "hallucination" is itself contested; cognitive scientist Christopher Summerfield suggests confabulation is more accurate. This matters for annotation guidelines: annotators need precise definitions of what counts as a hallucinated fact versus an acceptable generation.
Widdows & Cohen, §6.1.1.
↩ Back to article7Bradley-Terry and pairwise preference
Jurafsky and Martin formalize pairwise ranking mathematically in SLP3 §9.2.2 using the Bradley-Terry model: the probability that output o_i is preferred over o_j is the logistic sigmoid of their score difference. This formulation turns pairwise human preferences into probabilistic scores suitable for gradient-based learning. The InstructGPT team had annotators rank sets of four sampled outputs, yielding six preference pairs per ranked list. The Bradley-Terry mathematics converts ordinal "which is better" judgments into the cardinal scores needed for reward-model training. Cohen (1960) introduces the matching pairwise-agreement metric.
SLP3 §9.2.2; Cohen (1960). Read SLP3
↩ Back to articleInter-Annotator Agreement
1Krippendorff's alpha as the default IAA metric
Krippendorff's Content Analysis is the canonical reference for reliability metrics. The alpha coefficient handles any number of annotators, any measurement scale (nominal, ordinal, interval, ratio), and missing data, which is why it is the recommended default in this article. The text also explains why kappa-family metrics fail on imbalanced label distributions, which is a common pitfall in LLM-output annotation where one label often dominates.
Krippendorff (2018), Content Analysis, 4th ed.
↩ Back to article6The IAA-metric assumptions you have to choose between
Artstein and Poesio's survey is the standard reference for choosing between pairwise and multi-annotator agreement metrics. They walk through the assumptions each metric makes about chance-agreement, the distinction between observed and expected agreement, and the cases where Cohen's kappa, Fleiss' kappa, and Krippendorff's alpha give different verdicts on the same data. Worth reading once before publishing any IAA number.
Artstein & Poesio (2008), Computational Linguistics 34(4). doi.org/10.1162/coli.07-034-R2
↩ Back to articleAnnotator Selection and Training
KPreference tuning closes the annotation/training loop
Alammar and Grootendorst describe the three-step LLM training process, where the final step is preference tuning using human-provided accepted/rejected response pairs. This is the direct inverse of the annotation task described in the article: the same human judgments that train better models also serve as the gold standard for evaluating them. Annotation quality determines both model quality and evaluation quality.
Alammar & Grootendorst, Ch. 12.
↩ Back to article2When crowdworkers can substitute for experts
Snow et al. demonstrated, at the start of the modern crowd-annotation era, that aggregated crowdworker annotations can match expert quality on many NLP tasks when quality-control mechanisms (gold items, redundancy, careful filtering) are in place. This is the empirical license for the article's recommendation that experts write the guidelines and crowdworkers scale the application of those guidelines. The boundary cases (domain-specific tasks, expert-only judgment) are exactly where this license does not extend.
Snow et al. (2008). aclanthology.org/D08-1027
↩ Back to articleLExpert annotation propagates through every model trained on the data
Widdows and Cohen describe a concrete example of expert annotation in Ch. 5.2.3. The Alpaca model was fine-tuned on 52,000 prompts paired with responses from GPT-4, "a leading model from OpenAI that was trained to follow instructions and produce responses preferred by human raters." The human raters who shaped GPT-4's instruction-following behavior are doing exactly the domain-expert annotation described in the article, and the quality of those human judgments propagates through to every model trained on GPT-4's outputs.
Widdows & Cohen, §5.2.3.
↩ Back to articleSample Size and Statistical Power
MCluster-level evaluation in instruction-tuning datasets
For context on the scale of human evaluation in practice: Jurafsky and Martin note in SLP3 §9.1.2 that instruction-tuning datasets contain enormous numbers of tasks (SuperNaturalInstructions has 1,600 tasks across 76 clusters). Evaluating instruction-tuned models uses leave-one-out at the cluster level rather than the individual task level, because overlapping tasks within a cluster would inflate performance estimates. This methodological rigor in evaluation design carries over to calibration sample sizes: 100 to 200 examples is the practical floor, but those examples must be carefully stratified to avoid within-cluster contamination.
SLP3 §9.1.2. Read SLP3
↩ Back to articleThe Cost Problem
NNo single evaluation method is sufficient
Raschka confronts the evaluation challenge directly: unlike classification accuracy, instruction quality is subjective. His book evaluates with benchmarks (MMLU), human review, and automated LLM scoring, concluding that no single method is sufficient. This mirrors the hybrid approach recommended in the article: automated layers for breadth, human layers for depth, and a tiered workflow that routes the right examples to the right reviewer.
Raschka, Ch. 7.
↩ Back to articleOThe Tier 1/2/3 pattern in a real clinical trial
Widdows and Cohen describe a real-world example of this tiered approach in Ch. 6.2. In a clinical trial of LLM-based cognitive behavioral therapy, investigators "scrupulously reviewed all interactions with the system," providing safety outreach for fifteen incidents involving self-harm risk and corrections in thirteen cases of out-of-scope medical advice. They note this level of oversight "would likely be difficult to scale outside the context of a clinical trial," but that the characterization of interactions requiring intervention could inform "the development of automated guardrails ... and the development of semi-automated pipelines in which a classifier flags potentially concerning posts for review." This is the Tier 1 / Tier 2 / Tier 3 pattern in the wild.
Widdows & Cohen, §6.2.
↩ Back to articleCombining Human and Automated Evaluation
4Validating the validators
Shankar et al. name the alignment gap between automated and human evaluation in their title: who validates the validators? The paper proposes a methodology for systematically aligning LLM-assisted evaluation of LLM outputs with human preferences, including the practice of holding out human-graded examples for measuring judge-human agreement and iterating on rubrics until calibration metrics cross practitioner thresholds.
Shankar et al. (2024). arXiv:2404.12272
↩ Back to articleTools and Platforms
PThree sources of preference data, three tool shapes
Jurafsky and Martin describe the three sources of preference data in SLP3 §9.2.1: direct human annotation, implicit web judgments (Reddit, StackExchange votes), and fully synthetic collection using LLMs as annotators (for example UltraFeedback using GPT-4). Tools like Argilla are designed to support all three workflows. The existence of implicit preference data from platforms where accumulated user votes impose rankings on outputs (SLP3 Fig. 9.6) suggests that production systems can harvest evaluation signals from their own user interactions, not just from dedicated annotation campaigns.
SLP3 §9.2.1. Read SLP3
↩ Back to articleCommon Mistakes
QWhy surface metrics break on open-ended tasks
Alammar and Grootendorst discuss precision, recall, and F1 in the context of open-ended generation and note that those automated metrics can measure surface properties but not the dimensions that matter most: helpfulness, safety, appropriateness. Many common annotation mistakes stem from trying to make human evaluation as mechanical as automated metrics, which collapses the very dimensions human evaluation was supposed to capture.
Alammar & Grootendorst, Ch. 4.
↩ Back to articleRReliability is not validity, and maker's bias is real
Widdows and Cohen describe a version of this reliability-validity gap in Ch. 1.4. They warn about a "maker's bias" in machine learning: as engineers, we want our models to be valid and valuable, which "can make us eager to believe that the world is more like the simple situation ... rather than the muddle." They go further: "a preference for certainty and simplicity is well-known in many human areas: in machine learning, this can manifest itself as (literally) a bias in favor of more biased models." The same dynamic applies to annotation: guidelines that produce high agreement may be oversimplifying the evaluation task.
Widdows & Cohen, §1.4.
↩ Back to articleSData contamination and the limits of automated benchmarks
Jurafsky and Martin raise a structural problem with automated benchmarks in SLP3 §7.6.2: data contamination. Since LLMs train on web data and benchmarks like MMLU are on the web, models may incorporate some benchmark questions into their training, which overstates performance. They also note in §7.6.3 that accuracy alone is insufficient; evaluation must consider energy use, fairness (StereoSet, RealToxicityPrompts, BBQ), and model size. These structural limitations are another reason human evaluation remains irreplaceable: it cannot be contaminated by training-data overlap, and it can naturally assess dimensions that no single benchmark captures.
SLP3 §7.6.2 and §7.6.3. Read SLP3
↩ Back to article