Sources

Grounding, citations, and further reading for Measuring What RAG Actually Produces.

All of this is optional. The article itself is the tutorial. This page exists for readers who want to follow the citation trail back to the primary sources, see the textbook grounding for each claim, and read deeper into the underlying literature.

Nothing on this page is required reading, and you do not need to purchase any of these books. The numbered references in the article hyperlink to the corresponding entries here, so you can jump in at the point of interest and follow the back-to-article link to return.

About the Sources

SLP3: Jurafsky & Martin

Jurafsky, Daniel & James H. Martin. Speech and Language Processing, 3rd ed. (draft).

The standard academic textbook for NLP. Freely available in draft form at web.stanford.edu/~jurafsky/slp3/. Chapter 11 covers question answering and information retrieval, including the formal definitions of precision, recall, MAP, and the QA evaluation framework that the RAGAS metrics build on.

Widdows & Cohen: Large Language Models: How They Work and Why They Matter

Widdows, Dominic & Trevor Cohen. SemanticVectors Publishing, 2025.

Accessible and mathematically grounded survey of LLM architecture and behavior. Section 5.3.3 frames RAG as a "computational compromise" and discusses where the grounding promise of RAG breaks down. Chapter 6.1 covers hallucination, confabulation, and the historical separation between fact stores and language models.

Alammar & Grootendorst: Hands-On Large Language Models

Alammar, Jay & Maarten Grootendorst. O'Reilly Media, 2024.

Practitioner-oriented survey. Chapter 8 covers RAG end-to-end, including grounding through retrieved documents with citations. Chapter 12 discusses evaluation across the broader LLM stack and the inherent difficulty of single-metric judgments.

Es et al.: RAGAS paper

Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). arXiv:2309.15217.

The paper that named and codified the RAGAS framework. Defines the four canonical metrics (faithfulness, answer relevance, context precision, context recall) and the LLM-as-judge approach that makes reference-free evaluation practical. Available at arxiv.org/abs/2309.15217.

Zheng et al.: Judging LLM-as-a-Judge

Zheng, L., Chiang, W.-L., Sheng, Y., et al. (2023). NeurIPS 2023. arXiv:2306.05685.

The empirical anchor for the LLM-as-judge paradigm. Measures agreement between strong models and human evaluators across MT-Bench and Chatbot Arena, finding that GPT-4-class models agree with humans at roughly the same rate as humans agree with each other. Available at arxiv.org/abs/2306.05685.

Liu et al.: Lost in the Middle

Liu, N. F., Lin, K., Hewitt, J., et al. (2023). arXiv:2307.03172.

The "lost in the middle" effect: language models extract information from the beginning and end of a long context more reliably than from the middle. The motivation for caring about context precision even when Recall@K is high. Available at arxiv.org/abs/2307.03172.

Liu et al.: G-Eval

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2023). EMNLP 2023. arXiv:2303.16634.

Applies chain-of-thought prompting to LLM-as-judge evaluation. Shows that asking the judge to reason about criteria first, then assign scores, improves agreement with human raters versus a direct-scoring prompt. Available at arxiv.org/abs/2303.16634.

Lewis et al.: original RAG paper

Lewis, P., Perez, E., Piktus, A., et al. (2020). NeurIPS 2020. arXiv:2005.11401.

The 2020 paper that introduced retrieval-augmented generation as a named architecture. Establishes the two-stage retriever-plus-generator framing that all subsequent evaluation work builds on. Available at arxiv.org/abs/2005.11401.

Srivastava et al.: Dropout

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). JMLR 15(56): 1929-1958.

The dropout regularization paper, cited in the article's example code block as a concrete claim that a faithfulness checker would verify. Background reading for the worked example. Available at jmlr.org/papers/v15/srivastava14a.html.

Saad-Falcon et al.: ARES

Saad-Falcon, J., Khattab, O., Potts, C., & Zaharia, M. (2023). arXiv:2311.09476.

An alternative automated RAG evaluation framework. Useful comparative reading for teams choosing between RAGAS, ARES, and custom LLM-as-judge stacks. Available at arxiv.org/abs/2311.09476.

The Evaluation Gap

5Lewis et al. on the original RAG architecture

Lewis and colleagues introduce retrieval-augmented generation as a named architecture in their 2020 paper. The work defines the two-stage retriever-plus-generator framing and demonstrates that grounding a generator in retrieved evidence improves performance on knowledge-intensive tasks. The two-stage decomposition is what makes the evaluation gap visible: each stage can be measured independently, and the failure modes that this article catalogs live in the seam between them.

Lewis et al. (2020), Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401

Sources

About the Sources

SLP3: Jurafsky & Martin

Widdows & Cohen: Large Language Models: How They Work and Why They Matter

Alammar & Grootendorst: Hands-On Large Language Models

Es et al.: RAGAS paper

Zheng et al.: Judging LLM-as-a-Judge

Liu et al.: Lost in the Middle

Liu et al.: G-Eval

Lewis et al.: original RAG paper

Srivastava et al.: Dropout

Saad-Falcon et al.: ARES

The Evaluation Gap

5Lewis et al. on the original RAG architecture

8The formal two-stage RAG architecture

9RAG as computational compromise

The RAGAS Framework

1The RAGAS framework definition

11Precision and recall, applied at the claim level

14Faithfulness as zero-shot natural language inference

2Empirical evidence for LLM-as-judge agreement

Faithfulness: The Most Critical Metric

15RAG's grounding promise and the role of faithfulness

16Comparison with token-level F1 in QA

17Hallucination versus confabulation

18Historical separation of facts and language

19Phantom references predate LLMs

Answer Relevance: Did You Actually Answer the Question?

20Cosine similarity and the distributional hypothesis

Context Precision and Context Recall

3The lost-in-the-middle effect

21Contextual embeddings and information-level recall

Implementing RAGAS Evaluation

6Srivastava et al.: Dropout as the worked-example claim

22The verify-each-claim step as zero-shot NLI

Beyond RAGAS: Other Evaluation Approaches

7Saad-Falcon et al.: ARES as an alternative framework

4G-Eval and chain-of-thought scoring

23QA benchmarks and inter-annotator agreement

24Exact match versus token-level F1

Building an Evaluation Pipeline

25TREC and the shared-task tradition

Common Pitfalls in RAG Evaluation

26Multi-dimensional evaluation

27Mean Average Precision as the retriever-side complement

28Maker's bias and the cost of fixed eval sets

29RAG marketing claims and the role of negative tests

Putting It All Together

30Masked language modeling under the evaluation stack