Sources

Grounding, citations, and further reading for Human Evaluation Frameworks for LLM Systems.

All of this is optional. The article itself is the tutorial. This page exists for readers who want to follow the citation trail back to the primary sources and read deeper into the annotation-methodology literature.

Nothing on this page is required reading, and you do not need to purchase any of these books. Numbered references in the article hyperlink to the corresponding entries here.

About the Sources

Krippendorff: Content Analysis (anchor reference)

Krippendorff, K. (2018). Content Analysis: An Introduction to Its Methodology (4th ed.). Sage Publications.

The definitive reference on reliability metrics for content analysis, including the alpha coefficient that bears Krippendorff's name. The text walks the assumptions behind every standard agreement metric, explains why kappa fails on imbalanced label distributions, and grounds the design choices any serious human-evaluation pipeline has to make.

Snow et al.: Cheap and Fast, But Is It Good?

Snow, R., O'Connor, B., Jurafsky, D., & Ng, A. Y. (2008). EMNLP 2008.

The empirical foundation for the modern crowd-annotation workflow. Demonstrated that aggregated crowdworker annotations can match expert quality on many NLP tasks when quality control is in place. Available at aclanthology.org/D08-1027.

Zheng et al.: Judging LLM-as-a-Judge

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). NeurIPS 2023. arXiv:2306.05685.

Introduces the LLM-as-judge paradigm and compares automated judge performance to human evaluation, establishing the benchmarks for judge-human agreement that frame the calibration discussion in this article. Available at arxiv.org/abs/2306.05685.

Shankar et al.: Who Validates the Validators?

Shankar, S., Zamfirescu-Pereira, J. D., Hartmann, B., Heer, J., & Agrawala, M. (2024). arXiv:2404.12272.

Examines the alignment gap between automated and human evaluation and proposes a methodology for systematic validation of automated judges. The paper articulates why human evaluation cannot be fully replaced by automated judges, even sophisticated ones. Available at arxiv.org/abs/2404.12272.

Hovy & Lavid: corpus annotation methodology

Hovy, E. & Lavid, J. (2010). International Journal of Translation, 22(1).

Foundational paper on annotation methodology. Lays out the case that annotation is itself a science, emphasizing the importance of guidelines, training, and reliability measurement as load-bearing artifacts rather than overhead. Available at link.springer.com.

Artstein & Poesio: inter-coder agreement survey

Artstein, R. & Poesio, M. (2008). Computational Linguistics 34(4), 555-596.

Comprehensive survey of agreement metrics in computational linguistics, their assumptions, and appropriate use cases. The canonical reference when deciding which metric to compute for which annotation task. Available at doi.org/10.1162/coli.07-034-R2.

Cohen: the original kappa paper

Cohen, J. (1960). Educational and Psychological Measurement, 20(1), 37-46.

The original paper introducing Cohen's kappa, the most widely used pairwise agreement metric. Worth reading once for the historical context, even though Krippendorff (above) is the more practical day-to-day reference.

SLP3: Jurafsky & Martin

Jurafsky, Daniel & James H. Martin. Speech and Language Processing, 3rd ed. (draft).

The standard academic textbook for NLP. Freely available in draft form at web.stanford.edu/~jurafsky/slp3/. Chapter 9 is the canonical formal treatment of preference data, reward modeling, and the Bradley-Terry mathematics behind pairwise ranking; Chapter 7 covers data contamination and benchmark methodology. Cited several times below.

Widdows & Cohen: Large Language Models: How They Work and Why They Matter

Widdows, Dominic & Trevor Cohen. SemanticVectors Publishing, 2025.

Mathematically grounded survey of LLM architecture and behavior. Strong on the historical evaluation traditions (Cranfield, TREC) that human-evaluation pipelines inherit, and on the safety concerns (hallucination, sycophancy, clinical-trial-grade oversight) that make human evaluation non-negotiable in high-stakes domains. Cited several times below.

Alammar & Grootendorst: Hands-On Large Language Models

Alammar, Jay & Maarten Grootendorst. O'Reilly Media, 2024.

Practitioner-oriented survey. Useful for the discussion of DPO preference tuning and for the recommendation that benchmarks, LLM-as-judge, and human evaluation be combined as complementary signals rather than treated as substitutes.

The Irreducible Role of Human Judgment

ACranfield, TREC, and the Cleverdon tradition

Widdows and Cohen describe how information retrieval was guided by quantitative evaluation from the 1960s onward. Cleverdon's team at Cranfield University had human experts comb document collections to create relevance judgments as ground truth, which then became the standard for comparing automated retrieval systems. The pattern is the same one this article describes: humans define what "correct" looks like, and automated systems are measured against that human baseline.

Widdows & Cohen, §2.3.3 ("Search Evaluation").

Sources

About the Sources

Krippendorff: Content Analysis (anchor reference)

Snow et al.: Cheap and Fast, But Is It Good?

Zheng et al.: Judging LLM-as-a-Judge

Shankar et al.: Who Validates the Validators?

Hovy & Lavid: corpus annotation methodology

Artstein & Poesio: inter-coder agreement survey

Cohen: the original kappa paper

SLP3: Jurafsky & Martin

Widdows & Cohen: Large Language Models: How They Work and Why They Matter

Alammar & Grootendorst: Hands-On Large Language Models

The Irreducible Role of Human Judgment

ACranfield, TREC, and the Cleverdon tradition

BWhy preference judgments anchor the entire post-training stack

When Human Evaluation Is Necessary

CHallucination, toxicity, and sycophancy as failure modes

DClinical-trial-grade oversight for high-stakes LLMs

EWhy benchmarks miss novel tasks

3LLM-as-judge inherits its calibration from humans

FSycophancy and subjective evaluation

The Annotation Pipeline

GThe Cranfield precision/recall decomposition

5Annotation methodology as a science

Writing Annotation Guidelines

HAnnotation guidelines double as instruction-tuning prompts

Likert Scales vs. Binary vs. Ranking

IMulti-aspect Likert as the production standard

JThe contested boundary around "hallucinated fact"

7Bradley-Terry and pairwise preference

Inter-Annotator Agreement

1Krippendorff's alpha as the default IAA metric

6The IAA-metric assumptions you have to choose between

Annotator Selection and Training

KPreference tuning closes the annotation/training loop

2When crowdworkers can substitute for experts

LExpert annotation propagates through every model trained on the data

Sample Size and Statistical Power

MCluster-level evaluation in instruction-tuning datasets

The Cost Problem

NNo single evaluation method is sufficient

OThe Tier 1/2/3 pattern in a real clinical trial

Combining Human and Automated Evaluation

4Validating the validators

Tools and Platforms

PThree sources of preference data, three tool shapes

Common Mistakes

QWhy surface metrics break on open-ended tasks

RReliability is not validity, and maker's bias is real

SData contamination and the limits of automated benchmarks