← Back to article

Sources

Grounding, citations, and further reading for Measuring What RAG Actually Produces.

All of this is optional. The article itself is the tutorial. This page exists for readers who want to follow the citation trail back to the primary sources, see the textbook grounding for each claim, and read deeper into the underlying literature.

Nothing on this page is required reading, and you do not need to purchase any of these books. The numbered references in the article hyperlink to the corresponding entries here, so you can jump in at the point of interest and follow the back-to-article link to return.

About the Sources

SLP3: Jurafsky & Martin

Jurafsky, Daniel & James H. Martin. Speech and Language Processing, 3rd ed. (draft).

The standard academic textbook for NLP. Freely available in draft form at web.stanford.edu/~jurafsky/slp3/. Chapter 11 covers question answering and information retrieval, including the formal definitions of precision, recall, MAP, and the QA evaluation framework that the RAGAS metrics build on.

Widdows & Cohen: Large Language Models: How They Work and Why They Matter

Widdows, Dominic & Trevor Cohen. SemanticVectors Publishing, 2025.

Accessible and mathematically grounded survey of LLM architecture and behavior. Section 5.3.3 frames RAG as a "computational compromise" and discusses where the grounding promise of RAG breaks down. Chapter 6.1 covers hallucination, confabulation, and the historical separation between fact stores and language models.

Alammar & Grootendorst: Hands-On Large Language Models

Alammar, Jay & Maarten Grootendorst. O'Reilly Media, 2024.

Practitioner-oriented survey. Chapter 8 covers RAG end-to-end, including grounding through retrieved documents with citations. Chapter 12 discusses evaluation across the broader LLM stack and the inherent difficulty of single-metric judgments.

Es et al.: RAGAS paper

Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). arXiv:2309.15217.

The paper that named and codified the RAGAS framework. Defines the four canonical metrics (faithfulness, answer relevance, context precision, context recall) and the LLM-as-judge approach that makes reference-free evaluation practical. Available at arxiv.org/abs/2309.15217.

Zheng et al.: Judging LLM-as-a-Judge

Zheng, L., Chiang, W.-L., Sheng, Y., et al. (2023). NeurIPS 2023. arXiv:2306.05685.

The empirical anchor for the LLM-as-judge paradigm. Measures agreement between strong models and human evaluators across MT-Bench and Chatbot Arena, finding that GPT-4-class models agree with humans at roughly the same rate as humans agree with each other. Available at arxiv.org/abs/2306.05685.

Liu et al.: Lost in the Middle

Liu, N. F., Lin, K., Hewitt, J., et al. (2023). arXiv:2307.03172.

The "lost in the middle" effect: language models extract information from the beginning and end of a long context more reliably than from the middle. The motivation for caring about context precision even when Recall@K is high. Available at arxiv.org/abs/2307.03172.

Liu et al.: G-Eval

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2023). EMNLP 2023. arXiv:2303.16634.

Applies chain-of-thought prompting to LLM-as-judge evaluation. Shows that asking the judge to reason about criteria first, then assign scores, improves agreement with human raters versus a direct-scoring prompt. Available at arxiv.org/abs/2303.16634.

Lewis et al.: original RAG paper

Lewis, P., Perez, E., Piktus, A., et al. (2020). NeurIPS 2020. arXiv:2005.11401.

The 2020 paper that introduced retrieval-augmented generation as a named architecture. Establishes the two-stage retriever-plus-generator framing that all subsequent evaluation work builds on. Available at arxiv.org/abs/2005.11401.

Srivastava et al.: Dropout

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). JMLR 15(56): 1929-1958.

The dropout regularization paper, cited in the article's example code block as a concrete claim that a faithfulness checker would verify. Background reading for the worked example. Available at jmlr.org/papers/v15/srivastava14a.html.

Saad-Falcon et al.: ARES

Saad-Falcon, J., Khattab, O., Potts, C., & Zaharia, M. (2023). arXiv:2311.09476.

An alternative automated RAG evaluation framework. Useful comparative reading for teams choosing between RAGAS, ARES, and custom LLM-as-judge stacks. Available at arxiv.org/abs/2311.09476.

The Evaluation Gap

5Lewis et al. on the original RAG architecture

Lewis and colleagues introduce retrieval-augmented generation as a named architecture in their 2020 paper. The work defines the two-stage retriever-plus-generator framing and demonstrates that grounding a generator in retrieved evidence improves performance on knowledge-intensive tasks. The two-stage decomposition is what makes the evaluation gap visible: each stage can be measured independently, and the failure modes that this article catalogs live in the seam between them.

Lewis et al. (2020), Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401

↩ Back to article

8The formal two-stage RAG architecture

Jurafsky and Martin formalize the two-stage RAG architecture in SLP3 §11.4. They define RAG as a system with a retriever component that finds relevant passages and a generator (the LLM) that produces an answer conditioned on those passages. RAG was introduced specifically to "mitigate the problem of hallucination." The evaluation gap described in this article, where retrieval succeeds but generation fails, is the gap between these two components. SLP3 §11.6 addresses QA evaluation separately, reinforcing that retrieval and answer quality require distinct metrics.

SLP3 §11.4, §11.6. Read SLP3

↩ Back to article

9RAG as computational compromise

Widdows and Cohen describe RAG in Ch. 5.3.3 as a "computational compromise" and warn it is "easily misinterpreted": the use of domain-specific search results helps RAG produce more factual answers, "but it doesn't mean that these answers are produced directly from a database of established facts." This is exactly the gap RAGAS evaluation is designed to measure.

Widdows & Cohen, §5.3.3.

↩ Back to article

The RAGAS Framework

1The RAGAS framework definition

Es, James, Espinosa-Anke, and Schockaert introduce RAGAS as a reference-free evaluation framework. The four metrics (faithfulness, answer relevance, context precision, context recall) target distinct failure modes, and the framework uses an LLM-as-judge to compute each metric without requiring human-annotated ground truth answers for every example. The motivation is operational: human annotation does not scale, and existing automatic metrics (BLEU, ROUGE) measure surface-form overlap rather than semantic correctness.

Es et al. (2023), RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217

↩ Back to article

11Precision and recall, applied at the claim level

SLP3 §11.2 provides the formal definitions: precision = |R|/|T| (relevant documents returned divided by total returned) and recall = |R|/|U| (relevant returned divided by total relevant). RAGAS repurposes these concepts at the claim level rather than the document level. Context precision asks whether the retrieved chunks were useful (analogous to document-level precision); context recall asks whether all needed information was present (analogous to document-level recall). The mathematical structure is identical; the unit of analysis shifts from documents to information claims.

SLP3 §11.2. Read SLP3

↩ Back to article

14Faithfulness as zero-shot natural language inference

The verification task has deep roots in NLP. SLP3 §10.4.2 describes natural language inference (NLI), also called recognizing textual entailment: given a premise and a hypothesis, classify whether the premise entails, contradicts, or is neutral toward the hypothesis. Faithfulness checking is structurally identical. The retrieved context is the premise; each generated claim is the hypothesis. An "entails" verdict means the claim is supported. NLI was one of the first tasks BERT-style models were fine-tuned on, using datasets like MultiNLI (Williams et al., 2018). LLM-as-judge faithfulness evaluation is, in effect, zero-shot NLI at scale.

SLP3 §10.4.2. Read SLP3

↩ Back to article

2Empirical evidence for LLM-as-judge agreement

Zheng et al. evaluate LLM-as-judge across MT-Bench and Chatbot Arena, two benchmarks designed to test multi-turn instruction following and open-ended conversation. They report that GPT-4 achieves over 80% agreement with human evaluators, which is comparable to the inter-annotator agreement between trained humans. The paper also catalogs systematic biases (position bias, verbosity bias, self-enhancement bias) that judge LLMs exhibit, along with mitigations. RAGAS's claim-decomposition approach reduces the burden on the judge and partly sidesteps these biases.

Zheng et al. (2023), Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685

↩ Back to article

Faithfulness: The Most Critical Metric

15RAG's grounding promise and the role of faithfulness

Alammar and Grootendorst argue that RAG reduces hallucinations by grounding answers in retrieved documents with citations. Faithfulness is the metric that directly measures whether a RAG system is achieving this grounding, which makes it the most important validation of the RAG value proposition. A system whose faithfulness score is consistently below 1.0 is failing at the one thing RAG was designed to do.

Alammar & Grootendorst, Ch. 8.

↩ Back to article

16Comparison with token-level F1 in QA

Compare the faithfulness formula with SLP3 §11.6, which defines token-level F1 for free-text QA evaluation. F1 computes precision and recall over the tokens in the predicted answer versus the gold answer, then takes their harmonic mean. RAGAS faithfulness operates at a higher level of abstraction: instead of token overlap, it checks claim-level support against context. Both approaches decompose the answer into units (tokens vs. claims) and measure what fraction is correct, but RAGAS captures semantic entailment where F1 captures only lexical overlap.

SLP3 §11.6. Read SLP3

↩ Back to article

17Hallucination versus confabulation

Widdows and Cohen offer useful terminology in Ch. 6.1.1. They note that cognitive scientist Christopher Summerfield argues the behavior is closer to what in humans is called confabulation, "a much less alarming term than hallucination." They also cite the Galactica model, which generated plausible but fabricated scientific claims about Ivermectin, underscoring that "plausibility in and of itself can be persuasive." The faithfulness metric is effectively a confabulation detector.

Widdows & Cohen, §6.1.1.

↩ Back to article

18Historical separation of facts and language

Widdows and Cohen provide an illuminating historical contrast in Ch. 6.1.1. They describe the traditional architecture where "factuality and fluency were separate responsibilities: a knowledge base could be designed to store facts ... A language model's job wouldn't be to store and recall facts accurately, but to turn assertions into text." RAG attempts to restore this separation, and faithfulness scoring is the metric that verifies whether the separation is actually holding.

Widdows & Cohen, §6.1.1.

↩ Back to article

19Phantom references predate LLMs

Widdows and Cohen offer a wry tangent in Ch. 6.1.1: fabricated references are not unique to LLMs. They note that "a 'phantom reference' to an imaginary article was copied verbatim over 400 times by human authors, before chatbots were there to help!" This puts the faithfulness problem in perspective: while LLMs can confabulate at scale, the underlying failure mode of citing unsupported claims predates AI entirely. Faithfulness evaluation catches it regardless of origin.

Widdows & Cohen, §6.1.1.

↩ Back to article

Answer Relevance: Did You Actually Answer the Question?

20Cosine similarity and the distributional hypothesis

The "semantic similarity" step in answer relevance relies on the same cosine similarity that SLP3 §5.4 defines as the normalized dot product between two vectors. Jurafsky and Martin explain that cosine measures directional similarity independent of vector magnitude, making it the standard metric for comparing text embeddings. RAGAS encodes each generated question and the original question using a sentence embedding model, then averages the cosine similarities. The entire mechanism depends on the distributional hypothesis (§5.2): questions about the same topic produce similar embeddings because they share distributional context.

SLP3 §5.4, §5.2. Read SLP3

↩ Back to article

Context Precision and Context Recall

3The lost-in-the-middle effect

Liu, Lin, Hewitt and colleagues benchmark how language models use long contexts and find a strikingly U-shaped distribution: models extract information from the beginning and end of a context window more reliably than from the middle. The implication for RAG systems is direct. A retriever that returns ten chunks with the relevant one in position six will score perfectly on Recall@K, but the model will likely miss the relevant information anyway. Context precision is the RAGAS metric that catches this: a clean, focused context outperforms a noisy one even when Recall@K is identical.

Liu et al. (2023), Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172

↩ Back to article

21Contextual embeddings and information-level recall

The distinction between document-level and information-level recall reflects what SLP3 §10.3 calls the advantage of contextual embeddings over static representations. Where word2vec assigns a single vector per word regardless of context (§5.5), BERT-style models produce different vectors for the same word in different contexts. This means two documents can express the same information in completely different vocabulary and still be recognized as semantically equivalent by a contextual model. Context recall benefits from this property: it measures information coverage rather than document identity, which only works because contextual embeddings capture meaning, not just surface form.

SLP3 §10.3, §5.5. Read SLP3

↩ Back to article

Implementing RAGAS Evaluation

6Srivastava et al.: Dropout as the worked-example claim

The manual faithfulness checker uses a claim about Python 3.12 introducing dropout-like features as a worked example. The underlying dropout regularization technique was introduced by Srivastava and colleagues in their 2014 JMLR paper. The original paper is background reading for the worked example, not a load-bearing source for the article's argument, but it grounds the kind of factual claim the checker is asked to verify.

Srivastava et al. (2014), Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR 15(56)

↩ Back to article

22The verify-each-claim step as zero-shot NLI

The "verify each claim" step in the manual implementation is structurally identical to the NLI task described in SLP3 §10.4.2. Jurafsky and Martin define NLI as classifying whether a premise entails, contradicts, or is neutral toward a hypothesis, using the MultiNLI dataset with labels like "Jon walked to the town" entails "Jon traveled to his hometown." BERT models were originally fine-tuned on NLI by passing premise/hypothesis pairs through the encoder and classifying the [CLS] token output. The LLM-based faithfulness checker does the same task zero-shot: context is the premise, claim is the hypothesis, and "supported" maps to "entails."

SLP3 §10.4.2. Read SLP3

↩ Back to article

Beyond RAGAS: Other Evaluation Approaches

7Saad-Falcon et al.: ARES as an alternative framework

Saad-Falcon, Khattab, Potts, and Zaharia introduce ARES as a competing automated RAG evaluation framework. ARES uses a synthetic data generator to create question-answer-context triples, then trains lightweight judge models on those triples to evaluate target RAG systems. Useful comparative reading for teams choosing between RAGAS, ARES, and custom LLM-as-judge stacks. The two frameworks emphasize different tradeoffs: RAGAS is reference-free at evaluation time; ARES requires training a judge but then runs cheaper per-evaluation inference.

Saad-Falcon et al. (2023), ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. arXiv:2311.09476

↩ Back to article

4G-Eval and chain-of-thought scoring

Liu, Iter, Xu, and colleagues introduce G-Eval as a chain-of-thought variant of LLM-as-judge. The judge is prompted to first reason about the evaluation criteria in natural language, then assign a numerical score. The reasoning step constrains the score in a way that direct scoring does not, and the paper reports higher agreement with human raters across summarization quality dimensions. G-Eval is the conceptual ancestor of many custom LLM-as-judge stacks deployed in production today.

Liu et al. (2023), G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv:2303.16634

↩ Back to article

23QA benchmarks and inter-annotator agreement

SLP3 §11.5 describes how standard QA benchmark datasets like Natural Questions and MS MARCO were constructed with human annotation. Natural Questions used Google queries with Wikipedia-derived answers; MS MARCO used Bing queries with human-written answers. The quality of these benchmarks depends on the same inter-annotator agreement issues this article raises. Jurafsky and Martin also note that TyDi QA extends this methodology across 11 languages, where annotation disagreement can reflect genuine cultural differences in what counts as a "correct" answer, not just rubric ambiguity.

SLP3 §11.5. Read SLP3

↩ Back to article

24Exact match versus token-level F1

SLP3 §11.6 distinguishes two evaluation paradigms based on answer format. For multiple-choice QA (like MMLU), exact match accuracy suffices: the system either picks the right option or it does not. For free-text QA, exact match is too strict because many valid answers differ in wording, so token-level F1 is used instead. Golden test suites for RAG systems face the same tension. Simple factual questions ("What is the max payload?") may be amenable to exact match, while complex explanatory questions require the kind of semantic evaluation RAGAS provides. The test suite should include both types.

SLP3 §11.6. Read SLP3

↩ Back to article

Building an Evaluation Pipeline

25TREC and the shared-task tradition

Widdows and Cohen describe the historical precedent for the CI/CD evaluation pattern in Ch. 2.3.3: the TREC (Text Retrieval Evaluation Conference) methodology, which began in 1992. They explain that "designing a conference around a shared task, with a dataset, known results, and agreed evaluation metrics" made it possible to compare systems on demonstrated results rather than assumptions. CI/CD evaluation pipelines for RAG are the modern descendant of this approach: automated, repeatable, and anchored to agreed-upon metrics.

Widdows & Cohen, §2.3.3.

↩ Back to article

Common Pitfalls in RAG Evaluation

26Multi-dimensional evaluation

Alammar and Grootendorst acknowledge that evaluation remains challenging and that no single metric works for all use cases. This reinforces the multi-dimensional approach advocated here: combining RAGAS metrics, human evaluation, and production monitoring rather than relying on any one measure.

Alammar & Grootendorst, Ch. 12.

↩ Back to article

27Mean Average Precision as the retriever-side complement

SLP3 §11.2 introduces Mean Average Precision (MAP) as the standard single-number summary of retrieval quality, computed by averaging precision at each recall level across all queries. This is the metric Jurafsky and Martin recommend for comparing retrieval systems, and it complements the RAGAS metrics discussed here. A useful diagnostic practice: compute MAP on the retriever independently, then compute RAGAS faithfulness on the full pipeline. If MAP is high but faithfulness is low, the problem is in the generation stage. If MAP is low, fix retrieval first, since RAGAS metrics will be noisy when the context is poor.

SLP3 §11.2. Read SLP3

↩ Back to article

28Maker's bias and the cost of fixed eval sets

Widdows and Cohen describe a related cognitive trap in Ch. 1.4: "maker's bias." They write that "as machine learning engineers, we want our models to be valid and valuable. This can make us eager to believe that the world is more like the simple situation on the left, rather than the muddle on the right." The same bias applies to RAG evaluation: tuning to a fixed eval set feels productive, but the real world is always messier than the benchmark.

Widdows & Cohen, §1.4.

↩ Back to article

29RAG marketing claims and the role of negative tests

Widdows and Cohen explicitly critique RAG marketing in Ch. 5.3.3, noting that RAG products use phrases like "It references an authoritative knowledge base outside of its training data sources before generating a response," which is "correct, in a sense, but the way RAG queries such a knowledge base doesn't constrain it to produce only sentences that are equally authoritative." Negative test cases are precisely how a team verifies whether a RAG system lives up to these marketing promises.

Widdows & Cohen, §5.3.3.

↩ Back to article

Putting It All Together

30Masked language modeling under the evaluation stack

SLP3 §10.2 describes the masked language modeling training objective that underlies the BERT family of models. The model learns to predict masked tokens from bidirectional context, a form of self-supervision that requires no labeled data. This is the training paradigm that produced the encoder models underlying many RAG evaluation components: the sentence embedding models used for answer relevance scoring, the cross-encoders used for reranking, and the NLI models that inspired faithfulness checking. The entire RAG evaluation stack, from RAGAS metrics to custom LLM judges, rests on foundations that SLP3 traces to the masked language modeling objective.

SLP3 §10.2. Read SLP3

↩ Back to article