Sources

Grounding, citations, and further reading for The Retrieval Quality Problem.

All of this is optional. These are the sources behind the article. Nothing on this page is required reading, and you do not need to purchase any of these books.

The article itself is self-contained. This page exists so that the work is properly cited and so that anyone who wants to go deeper knows where to look.

References

1Robertson, S

Robertson, S. & Zaragoza, H. (2009). "The Probabilistic Relevance Framework: BM25 and Beyond." Foundations and Trends in Information Retrieval, 3(4), 333-389.

2Gao, L

Gao, L., Ma, X., Lin, J., & Callan, J. (2022). "Precise Zero-Shot Dense Retrieval without Relevance Labels." arXiv:2212.10496.

3Nogueira, R

Nogueira, R. & Cho, K. (2019). "Passage Re-ranking with BERT." arXiv:1901.04085.

4Lewis, P

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020.

5Liu, N

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172.

6Cormack, G

Cormack, G. V., Clarke, C. L. A., & Butt, S. (2009). "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods." SIGIR 2009.

7Zheng, Z

Zheng, Z., et al. (2023). "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models." arXiv:2310.06117.

Introduction

8Grounding note

Widdows and Cohen provide useful context here. In Ch. 5.3.3, they describe RAG as "very much a computational compromise" and explicitly warn that it is "easily misinterpreted." They note that while RAG "produce[s] more factual answers than the initial language model," it does not mean answers "are produced directly from a database of established facts." This aligns with the article's framing that retrieval failures, not generation failures, are the root cause. Widdows & Cohen, Issue #45

9Grounding note

Jurafsky and Martin formalize the RAG architecture in SLP3 §11.4 as a two-component system: a retriever that finds relevant passages and a generator (the LLM) that produces an answer conditioned on those passages. They note that RAG was introduced specifically to "mitigate the problem of hallucination" by grounding generation in retrieved evidence. The critical architectural insight is that the retriever's failures propagate directly into generation failures, exactly the pattern this article describes. SLP3 §11.4

Precision and Recall: The Fundamental Tension

10Grounding note

Jurafsky and Martin give the formal definitions in SLP3 §11.2. Let T be the set of documents returned by the system, U be the set of all relevant documents, and R = T ∩ U (the relevant documents that were returned). Then precision = |R|/|T| and recall = |R|/|U|. This notation makes the tradeoff arithmetic: increasing |T| to capture more of U (higher recall) also admits more non-relevant documents (lower precision). The article's intuitive framing maps directly onto these formal relationships. SLP3 §11.2

11Grounding note

Widdows and Cohen trace precision and recall back to the Cranfield experiments of the early 1960s (Ch. 2.3.3). Cyril Cleverdon's team established these as the canonical IR evaluation metrics, defining precision as "the proportion of documents retrieved that are actually relevant" and recall as "the proportion of documents that are relevant that are actually retrieved." They note the inherent tension: "making some errors go away makes others more likely." This historical lineage directly supports the article's claim that these tradeoffs have been studied for decades. Widdows & Cohen, Issue #45

12Grounding note

The article's use of "confabulates" is worth noting. Widdows and Cohen discuss this terminology in Ch. 6.1.1, citing cognitive scientist Christopher Summerfield's observation that LLM behavior "is closer to what we in humans is called confabulation, a much less alarming term than hallucination." They argue that generative models produce statistically plausible text, not recalled facts, making confabulation the more accurate description. This is a useful tangent for the article's framing of what happens when retrieval fails. Widdows & Cohen, Issue #45

13Grounding note

SLP3 §11.2 formalizes this tradeoff curve as the interpolated precision function, where precision is measured at eleven standard recall levels (0.0, 0.1, ... 1.0). The resulting precision-recall curve makes the tradeoff visually explicit: a system that returns more documents pushes toward the high-recall end of the curve, but precision drops along the way. Mean Average Precision (MAP), computed as the mean of average precision across queries, compresses this entire curve into a single number for system comparison. SLP3 §11.2

14Grounding note

The textbook positions RAG as improving accuracy by having LLMs summarize retrieved documents rather than generating from memory, enabling citations. The key insight: RAG doesn't fix the model; it gives the model better inputs to work with. See GH #3, Ch. 5.

Dense Retrieval: Searching by Meaning

15Grounding note

Widdows and Cohen provide deep historical context for this technique in Ch. 2.2. Vector-based search originated in the 1960s, when IR researchers moved beyond Boolean logic to use term-document matrices and cosine similarity for graded relevance rankings. They show how comparing query vectors with document vectors using cosine similarity "became a terrific abstract tool." The modern dense retrieval described in this article is a direct descendant of these methods, with neural embeddings replacing the original one-hot term vectors. Widdows & Cohen, Issue #45

16Grounding note

The theoretical foundation for why this works is the distributional hypothesis, formalized in SLP3 §5.2: "words that occur in similar contexts tend to have similar meanings." Jurafsky and Martin trace this idea to J.R. Firth's 1957 dictum, "You shall know a word by the company it keeps." Dense retrieval is, at bottom, an application of this principle at the passage level. Two passages that discuss similar concepts in similar linguistic contexts will produce similar vectors, even if they share no vocabulary. The distributional hypothesis is the reason embeddings work at all. SLP3 §5.2

17Grounding note

Jurafsky and Martin cite Furnas et al. (1987) in SLP3 §11.3, who demonstrated that the probability of two people choosing the same term for a concept is less than 20%. This vocabulary mismatch problem was the original motivation for moving beyond keyword-based retrieval. Dense retrieval addresses this by learning representations where semantically similar texts cluster together, but as this article notes, the compression introduces its own form of information loss. The mismatch problem is not eliminated; it is transformed. SLP3 §11.3

18Grounding note

Widdows and Cohen illustrate exactly this problem in Ch. 3.4 with a vivid example. They show that a RAG system queried about the Alfa Romeo car brand could be led astray by documents about Shakespeare's Romeo and Juliet, because a single global embedding vector conflates all contextual meanings of "Romeo." They call this the "problem of synonymy" and note it "can happen when a single global embedding vector is used to represent all of the contextual meanings of a word." This directly parallels the article's "vocabulary erasure" concept. Widdows & Cohen, Issue #45

19Grounding note

Alammar & Grootendorst explicitly note these dense retrieval caveats: semantic search may return irrelevant results and struggles with exact phrase matching. They recommend hybrid search combining dense retrieval with keyword-based methods to compensate. See GH #5, Ch. 8.

20Grounding note

Widdows and Cohen discuss the dimensionality tradeoff extensively in Ch. 2.4, in the context of Latent Semantic Analysis (LSA). LSA used principal component analysis to "reduce the dimensionality of a dataset as much as possible, while preserving as much of the spread of data coordinates as possible along each direction." They also note that in high dimensions, "a random choice of axes for the lower-dimensional representation does surprisingly well," achieving much of the performance at a fraction of the cost. This is useful context for understanding why smaller embedding models can be competitive. Widdows & Cohen, Issue #45

21Grounding note

SLP3 §5.5 draws a critical distinction between static embeddings (word2vec, GloVe) where each word gets a single vector regardless of context, and contextual embeddings (BERT, GPT) where the representation changes based on surrounding text. For domain-specific RAG, this distinction matters: static embeddings trained on general text may assign the same vector to "bank" whether it refers to a financial institution or a river bank. Contextual models resolve this ambiguity, which is part of why transformer-based bi-encoders outperform earlier approaches for retrieval. SLP3 §5.5

Sparse Retrieval: The Persistence of Keywords

22Grounding note

Widdows and Cohen provide rich historical context for this lineage in Ch. 2.3.1. They credit Karen Sparck-Jones's seminal work in the 1970s for the insight that "rarer terms are more specific, and thus more relevant to search engines." They describe BM25 as an extension of tf-idf that "adjusts for document length," and note that Sparck-Jones and Stephen Robertson were key collaborators. The book also makes an elegant analogy: just as Galileo simplified physics by recognizing that gravitational attraction depends on mass, "statistical term weighting demonstrated a similarly simple relationship between scarcity and relevance." Widdows & Cohen, Issue #45

23Grounding note

SLP3 §11.1 gives the full BM25 scoring formula and explains how it extends tf-idf. The key parameters are k (controlling term frequency saturation, typically 1.2-2.0) and b (controlling document length normalization, typically 0.75). When b = 1, long documents are penalized heavily; when b = 0, document length is ignored entirely. Jurafsky and Martin also note that tf-idf itself builds on a deeper insight: the tf component captures how central a term is to a document, while the idf component, defined as log(N/df_t), captures how discriminative a term is across the collection. SLP3 §11.1

24Grounding note

Widdows and Cohen illustrate this exact limitation in Ch. 2.4, where they note that early term vector models had no mechanism to connect the query "walking group" with a document containing "ambling society." They call this the "symbol grounding problem," citing Stevan Harnad. The development of neural word embeddings (Ch. 3.3) was in part a response to this limitation, producing vectors where "words with similar roles or meanings tend to occur in similar contexts," providing "a natural solution to what is sometimes referred to as the problem of synonymy." Widdows & Cohen, Issue #45

Hybrid Search: Combining What Works

25Grounding note

SLP3 §11.3 notes that FAISS (Facebook AI Similarity Search) is the standard tool for approximate nearest neighbor search over dense vectors at scale. Jurafsky and Martin also explain in §5.4 why cosine similarity, not raw dot product, is the standard metric: cosine normalizes by vector length, so it measures directional similarity rather than magnitude. A document embedding with large magnitude would dominate raw dot product comparisons regardless of relevance. The code above uses np.dot, which works when vectors are already normalized, but in production FAISS configurations the distinction matters. SLP3 §11.3, §5.4

Reranking: The Second Pass

26Grounding note

SLP3 §11.3 describes the bi-encoder architecture in detail: query and document are encoded separately by two independent transformer encoders, and relevance is computed as the dot product of the resulting vectors. This independence is what makes bi-encoders fast (documents can be pre-encoded) but also what limits them. Cross-encoders, by contrast, concatenate query and passage into a single input, allowing full attention across all token pairs. Jurafsky and Martin also discuss ColBERT as a middle ground: it uses a bi-encoder to produce per-token vectors, then computes relevance via MaxSim operators, capturing finer-grained interactions without the full cost of a cross-encoder. SLP3 §11.3

27Grounding note

Widdows and Cohen explain why this joint processing is so powerful in Ch. 4.3, where they describe how attention heads act as learned projection operators that highlight relationships between token pairs. They note that "attention is guided by executive control systems" that manage "an interaction weight budget, so that the most important relationships come to the fore." This is the mechanism that makes cross-encoders superior to bi-encoders: the attention layers can directly model whether a document answers a query, not just whether they share a topic. Widdows & Cohen, Issue #45

28Grounding note

The book reports that rerankers dramatically improve search quality, with nDCG jumping from 36.5 to 62.8 using cross-encoder scoring. This aligns with the two-stage pattern described here: cheap first-stage retrieval followed by expensive but precise reranking. See GH #5, Ch. 8.

Query Transformation: Making the Question Better

29Grounding note

SLP3 §11.1 explains the data structure that makes query expansion practical for sparse retrieval: the inverted index, which maps each term in the vocabulary to the list of documents containing it. Query expansion adds more terms to the lookup, and the inverted index returns their document lists in constant time per term. Without this structure, expanding a query from 3 terms to 8 would be computationally expensive. With it, the cost is negligible. This is the infrastructure that has made keyword search scale to billions of documents since the 1990s. SLP3 §11.1

Evaluation: Measuring What Matters

30Grounding note

Widdows and Cohen describe the origin of this evaluation methodology in Ch. 2.3.3. The 1992 Text Retrieval Evaluation Conference (TREC) formalized the practice of shared evaluation tasks with agreed-upon datasets, queries, relevance judgments, and metrics. They argue that TREC-style evaluation "made it harder to prefer complicated language processing systems over simple ones" by requiring demonstrated results over assumptions. The article's emphasis on rigorous evaluation follows this same tradition. Widdows & Cohen, Issue #45

31Grounding note

SLP3 §11.6 addresses a subtlety the article's metrics do not cover: how to evaluate the answer produced after retrieval, not just the retrieval itself. For multiple-choice QA, Jurafsky and Martin recommend exact match accuracy. For free-text answers, they use token-level F1, computing precision and recall over the tokens in the predicted answer versus the gold answer. This matters because a RAG system can retrieve the right document but still generate a poor answer, and the retrieval metrics alone will not catch that. Separating retrieval evaluation from answer evaluation, as the article recommends, aligns with SLP3's two-layer approach. SLP3 §11.6

32Grounding note

SLP3 §11.5 catalogs the standard benchmark datasets that serve as common evaluation baselines: Natural Questions (real Google queries with Wikipedia answers), MS MARCO (Bing queries with human-written answers), MMLU (multiple-choice across 57 academic subjects), and TyDi QA (multilingual queries across 11 languages). These provide useful reference points even if your production evaluation set is domain-specific, because they let you compare your retriever's performance against published baselines before tuning for your own data. SLP3 §11.5

The Chunk Size Question

33Grounding note

Widdows and Cohen note in Ch. 2.3.1 that the granularity question is not new: they observe that "deciding whether documents should be indexed at the book, chapter, paragraph, or verse level is a design decision that also affects document lengths," which directly influences BM25's length normalization. The chunk size question in modern RAG is essentially the same design decision that IR researchers have faced for decades. Widdows & Cohen, Issue #45