Sources

Grounding, citations, and further reading for What Breaks.

All of this is optional. These are the sources behind the article. Nothing on this page is required reading, and you do not need to purchase any of these books.

The article itself is self-contained. This page exists so that the work is properly cited and so that anyone who wants to go deeper on a specific failure mode knows where to look.

About the Sources

SLP3: Jurafsky & Martin

Jurafsky, Daniel & James H. Martin. Speech and Language Processing, 3rd ed. (draft).

The standard academic textbook for NLP. Freely available in draft form at web.stanford.edu/~jurafsky/slp3/. Chapter 7 on large language models, Chapter 10 on contextual embeddings, and Chapter 11 on retrieval and question answering are the primary sources for the grounding notes on this page.

Widdows & Cohen: Large Language Models: How They Work and Why They Matter

Widdows, Dominic & Trevor Cohen. SemanticVectors Publishing, 2025.

Accessible and mathematically grounded survey of LLM architecture and behavior. Chapters 5 and 6 cover prompt engineering, RAG caveats, agentic reliability, and the historical evolution of context windows. Particularly strong on distinguishing plausibility from correctness, which is the throughline of this article.

Alammar & Grootendorst: Hands-On Large Language Models

Alammar, Jay & Maarten Grootendorst. O'Reilly Media, 2024.

Practitioner-oriented survey that treats LLMs as deployed systems rather than research artifacts. Chapter 3 on autoregressive generation and Chapter 8 on RAG supply the grounding for the context-window and retrieval failure modes discussed in the article.

Key RAG and Evaluation Papers

Barnett et al. (2024); Shankar et al. (2024); Chen, Zaharia, & Zou (2023).

Three open-access arXiv papers that map directly to the failure modes in this article: Barnett et al. taxonomize seven distinct RAG failure points, Shankar et al. document the gap between LLM-assisted evaluation and human preferences, and Chen et al. measure empirical drift in ChatGPT behavior across model updates. All freely available on arXiv.

Why LLM Failures Are Different

4Renze & Guven on sampling temperature and reliability ↑ back

Renze and Guven run a controlled study of how sampling temperature affects LLM problem-solving performance. Their finding is that higher temperatures introduce more variation in outputs without reliably improving task accuracy, which makes them a direct empirical source for the claim that the same input can yield a correct answer 95 times and a wrong answer 5. The paper is useful grounding for any discussion of why LLM failures are probabilistic rather than deterministic.

Renze, Matthew, and Erhan Guven. "The Effect of Sampling Temperature on Problem Solving in Large Language Models." arXiv preprint arXiv:2402.05201, 2024.

8Training objective as the root of LLM failure ↑ back

The textbook identifies a root cause for many failures: LLMs are trained to mimic human text, not to be accurate. The loss function rewards plausible next-token prediction, even when training data contains errors. Novel tasks cause failures because LLMs extrapolate poorly when tasks differ significantly from training data.

See GH #3, Ch. 4.

9Cross-entropy loss and the plausibility objective ↑ back

Jurafsky and Martin formalize the root cause of LLM failure in SLP3 §7.5. During training, the model learns to predict the next token by minimizing cross-entropy loss over a massive corpus. The loss function rewards generating tokens that are probable given the training distribution, not tokens that are correct by any external standard. This is the foundational reason why LLM failures look so different from traditional software bugs: the system is optimizing for distributional plausibility, and "correct" is not a variable in the objective function.

Jurafsky & Martin, SLP3 §7.5. Read SLP3

10Plausibility versus verbatim reproduction ↑ back

Widdows and Cohen provide useful context here. In Ch. 5, they explain that generation works by sampling from a learned probability distribution, with a temperature parameter constraining which tokens are likely selected. In Ch. 6.1.1, they emphasize that "language models are designed to generate text that is plausible ... rather than to reproduce text from this corpus verbatim." This is the root mechanic behind the stochastic failures described above: the model optimizes for plausibility, not correctness, so failures present as fluent-but-wrong outputs.

Widdows & Cohen, Ch. 5, Ch. 6.1.1. Issue #45

11Temperature and the decoding-step origin of variance ↑ back

Jurafsky and Martin explain the stochastic mechanism in SLP3 §7.4. Generation works by sampling from a probability distribution over the vocabulary at each step. The temperature parameter controls this distribution: at temperature 0 (greedy decoding), the model always picks the highest-probability token; at higher temperatures, lower-probability tokens become viable alternatives. This is why the "same input, different outputs" phenomenon exists. With temperature above zero, the model may produce a correct answer 95 times and a subtly wrong one on the 96th, because a low-probability but plausible token was sampled at a critical decision point. The failure is baked into the generation algorithm itself.

Jurafsky & Martin, SLP3 §7.4. Read SLP3

Failure Type 1: Prompt Drift

1Anthropic on model card evaluation and version behavior ↑ back

Anthropic's model card documentation describes the behavioral profile of each Claude release, including differences in refusal behavior, instruction following, and sensitivity to system-prompt wording. The document is the canonical primary source for the claim that Claude 3.5 interprets system prompts differently from Claude 3. Any team pinning a model version should read the model card for the next version before migrating.

Anthropic. "Model Card and Evaluations for Claude Models." Anthropic Research, 2024.

2Chen, Zaharia, & Zou on empirical behavior drift ↑ back

Chen, Zaharia, and Zou measured how GPT-4 and GPT-3.5 performance shifted on standardized tasks between March and June of 2023. Their finding is that model behavior can change materially across dates even when the published model name does not. This is the empirical backbone for treating model upgrades as deployment events: the same prompt routed to "the same" model can produce different outputs weeks apart, and production systems that assume stability will drift without warning.

Chen, Lingjiao, Matei Zaharia, and James Zou. "How is ChatGPT's Behavior Changing over Time?" arXiv preprint arXiv:2307.09009, 2023.

7Khattab et al. on programmatic prompt management (DSPy) ↑ back

DSPy reframes prompt engineering as a compilation problem: declare what the model should do, and let the framework optimize the prompt against a metric. The relevance to prompt drift is that the prompt becomes a compiled artifact rather than a hand-tuned string, so model upgrades can trigger recompilation rather than silent degradation. The paper is the best current reference for teams building systematic defenses against the drift described in Failure Type 1.

Khattab, Omar, et al. "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines." arXiv preprint arXiv:2310.03714, 2023.

12Prefix sensitivity and the mechanics of prompt drift ↑ back

Widdows and Cohen note in Ch. 5.2.3 that "it's remarkable how little additional training data are required to transform a language generating model into an instruction following one." They also describe prefix-tuning (Ch. 5.3.4), where a prefix vector concatenated to each input at each layer adapts model behavior, inspired by natural language prompt-prefixes. This explains why prompt drift is so dangerous: the relationship between prompt wording and model behavior is highly sensitive, and small changes in model weights after an update can shift how the same prefix is interpreted.

Widdows & Cohen, Ch. 5.2.3, Ch. 5.3.4. Issue #45

13System prompts as calibrated artifacts ↑ back

Jurafsky and Martin define the system prompt in SLP3 §7.3 as "a kind of meta-prompt that tells the LLM how to behave when responding to user prompts." They also describe prompt engineering as the practice of "designing prompts that are effective for a particular task." What the prompt drift scenario reveals is that prompt engineering is not a one-time activity. It is an ongoing calibration problem. The system prompt is designed against a specific model's interpretation of language. When the model changes, the interpretation changes, and a prompt that was "engineered" for one model becomes unengineered for its successor.

Jurafsky & Martin, SLP3 §7.3. Read SLP3

Failure Type 2: Retrieval Poisoning

3Barnett et al. on seven RAG failure points ↑ back

Barnett and colleagues catalog seven distinct points at which a RAG pipeline can fail, including chunking errors, missing content, incorrect context, and incorrect specificity. The paper is the best single reference for the claim that RAG introduces a new class of failure modes beyond the LLM itself, and chunking errors are explicitly called out as one of the most common and most dangerous. The legal-citation failure described in this article is a textbook example of their "chunking error" category.

Barnett, Scott, et al. "Seven Failure Points When Engineering a Retrieval Augmented Generation System." arXiv preprint arXiv:2401.05856, 2024.

14RAG caveats and the amplification of retrieval errors ↑ back

Alammar & Grootendorst warn about RAG caveats: retrieval may return irrelevant results, and the system struggles with exact matching. Retrieval poisoning is the extreme case of these caveats. When the retrieval layer returns corrupted or fragmented content, the model reconstructs plausible but fictional outputs, compounding the retrieval failure with generative hallucination.

See GH #5, Alammar & Grootendorst, Ch. 8.

15RAG as a generator conditioned on retrieved passages ↑ back

Jurafsky and Martin describe the RAG architecture formally in SLP3 §11.4. The generator receives the query and retrieved passages as context, then produces an answer conditioned on both. They note that RAG systems can include "knowledge citations" where the model attributes its answer to specific retrieved passages. The legal citation fabrication case illustrates a failure mode that the formal RAG architecture does not address: when the retrieved passages themselves are corrupted (split citations), the generator produces outputs that faithfully reflect the corruption. The architecture assumes the retriever returns coherent passages. This assumption is load-bearing, and chunk quality validation is the engineering work required to uphold it.

Jurafsky & Martin, SLP3 §11.4. Read SLP3

16Phantom references and fabricated citations ↑ back

Widdows and Cohen directly address this failure pattern. In Ch. 5.3.3, they explain that RAG "is easily misinterpreted" and warn that while retrieval "helps RAG to produce more factual answers," it "doesn't constrain it to produce only sentences that are equally authoritative." They also note in Ch. 6 that "a 'phantom reference' to an imaginary article was copied verbatim" in academic literature long before LLMs, and that AI-fabricated references have appeared in official government reports. The legal citation fabrication described here is the extreme case of this pattern.

Widdows & Cohen, Ch. 5.3.3, Ch. 6. Issue #45

Failure Type 3: Tool Cascade Failures

17Agent reliability versus agent capability ↑ back

Widdows and Cohen cite CMU's TheAgentStudy (Ch. 6), which "showed that AI agents are still deeply unreliable when it comes to carrying out tasks responsibly." They note that even though LLMs can pass elite university entrance exams, "agent-based studies demonstrate that completing even basic office tasks reliably is so hard to automate." This is exactly the gap between capability and reliability that tool cascade failures exploit: the model is capable of booking a hotel, but not reliable enough to handle the ambiguity of a timeout.

Widdows & Cohen, Ch. 6. Issue #45

Failure Type 4: Context Window Overflow

18Autoregressive compounding and silent truncation ↑ back

Alammar & Grootendorst explain that autoregressive generation means each token depends on previous predictions, and errors compound through the sequence. Context window overflow amplifies this: when the system prompt is silently truncated, the model generates tokens conditioned on incomplete context, and each subsequent token propagates that incomplete understanding further.

See GH #5, Alammar & Grootendorst, Ch. 3.

19Historical evolution of context-window limits ↑ back

Widdows and Cohen trace the evolution of context/sequence lengths in Table 5.1 (Ch. 5): BERT supported 512 tokens, GPT-3 rose to 2,048, and modern models reach 128k+. They also explain in Ch. 4.2.1 that the earliest neural language model (Bengio et al., 2000) used concatenation of context windows with a fixed-length window of preceding words. The fundamental constraint has always been the same: there is a finite window, and what falls outside it is invisible. Silent truncation is the modern version of this long-standing architectural limitation.

Widdows & Cohen, Ch. 4.2.1, Ch. 5 (Table 5.1). Issue #45

Failure Type 5: Evaluation Blind Spots

5Shankar et al. on validator-human alignment ↑ back

Shankar and colleagues study the "who validates the validators" problem: LLM-assisted evaluation pipelines are themselves LLMs, and their judgments drift from human preferences in systematic ways. Their empirical results document exactly the gap described in Failure Type 5, where automated metrics report high accuracy while user satisfaction lags. The paper also discusses techniques for realigning evaluation with human judgment, which informs the article's proposed fix.

Shankar, Shreya, et al. "Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences." arXiv preprint arXiv:2404.12272, 2024.

20Perplexity, MMLU, and the contamination problem ↑ back

Jurafsky and Martin discuss the evaluation gap in SLP3 §7.6. They describe perplexity as the standard intrinsic metric, noting it measures "how well the model predicts held-out data" but warns that perplexity improvements do not always correlate with downstream task performance. For extrinsic evaluation, they discuss benchmarks like MMLU, then immediately flag the data contamination problem: "if test data has been included in the training data, the measured accuracy will be inflated." The evaluation blind spot this article describes is a third failure mode beyond perplexity limitations and data contamination. It is a dimension mismatch: the benchmark measures accuracy, but users care about twelve dimensions that include accuracy as only one factor.

Jurafsky & Martin, SLP3 §7.6. Read SLP3

21Cranfield, TREC, and the maker's bias ↑ back

Widdows and Cohen provide deep historical context for this problem. In Ch. 2.3.3, they trace evaluation methodology from the Cranfield experiments of the 1960s through to TREC and GLUE benchmarks, and caution that "getting good results at one challenge doesn't always mean a system will adapt reliably to new tasks." They also note a "natural maker's bias" (Ch. 1): "as machine learning engineers, we want our models to be valid and valuable. This can make us eager to believe that the world is more like the simple situation on the left, rather than the muddle on the right." This is precisely the dynamic behind evaluation blind spots: optimizing for the measurable dimension while ignoring the messy reality of user experience.

Widdows & Cohen, Ch. 1, Ch. 2.3.3. Issue #45

22Precision, recall, and multi-faceted IR evaluation ↑ back

Jurafsky and Martin trace the roots of this multi-dimensional evaluation challenge to information retrieval in SLP3 §11.2. They describe precision, recall, and Mean Average Precision (MAP) as complementary metrics that each capture a different aspect of system quality. Precision alone is insufficient because a system that returns one perfect result has 100% precision but is useless for most tasks. The lesson from IR evaluation, established over decades, is that no single metric suffices. The team's expansion to tone, format, and task completion mirrors the IR field's evolution from raw precision to multi-faceted evaluation.

Jurafsky & Martin, SLP3 §11.2. Read SLP3

Failure Type 6: The Silent Degradation

6Paleyes et al. on deployment failure patterns ↑ back

Paleyes, Urma, and Lawrence survey the ML-deployment literature and organize the case studies into recurring failure patterns: data drift, concept drift, silent pipeline breakage, and monitoring gaps. Their framing predates the current LLM era but maps cleanly onto it, and their chapter on data drift is the best academic reference for the silent degradation failure mode. The survey is particularly useful because it is agnostic to model type, treating the symptoms at the pipeline level rather than the architecture level.

Paleyes, Andrei, Raoul-Gabriel Urma, and Neil D. Lawrence. "Challenges in Deploying Machine Learning: A Survey of Case Studies." ACM Computing Surveys 55(6), 2022.

23Overfitting patterns that appear in production ↑ back

Raschka shows how to detect overfitting: when validation loss diverges from training loss (validation stays high while training decreases), the model is memorizing rather than generalizing. This same pattern, observable in production metrics, is the silent degradation this article describes.

See GH #4, Raschka, Ch. 5.

24Sparse versus dense retrieval and interpretability ↑ back

Jurafsky and Martin describe two retrieval paradigms in SLP3 §11.1 and §11.3 that illuminate why this degradation was so hard to detect. Classical IR (§11.1) uses sparse representations like tf-idf and BM25, where term weights are transparent and interpretable. Dense retrieval (§11.3) uses learned embeddings from bi-encoders, where similarity scores are opaque. With BM25, a format change in date fields would be immediately visible in the term statistics. With dense retrieval, the same change manifests as a subtle shift in embedding space that is only detectable through distributional analysis. The move from sparse to dense retrieval traded interpretability for performance, and the cost of that trade is exactly the kind of silent drift described here.

Jurafsky & Martin, SLP3 §11.1, §11.3. Read SLP3

25Cosine similarity and the meaning of small shifts ↑ back

The cosine similarity metric used to detect this drift is explained from first principles in Widdows and Cohen, Ch. 2. They show that a cosine similarity of 0.75 "indicates that Doc1 and Doc2 have a great deal in common," and that "in higher dimensions, randomly chosen pairs of vectors tend to have cosine similarity closer to zero." This means a drop from 0.92 to 0.78 is a significant shift in high-dimensional embedding space, even though it looks like a small numeric change. The book's mathematical grounding helps explain why the KS-test drift detection in the article's code is the right tool for the job.

Widdows & Cohen, Ch. 2. Issue #45

26Contextual embeddings and format-driven drift ↑ back

Jurafsky and Martin explain in SLP3 §10.3 that contextual embeddings from models like BERT produce different vectors for the same word depending on its surrounding context. The word "bank" in "river bank" and "bank account" receives entirely different representations. This is the property that makes contextual embeddings powerful for retrieval, but it also makes drift detection more difficult. When upstream data changes format (dates shifting from ISO 8601 to locale-specific strings), the surrounding context of every date mention changes, and the contextual embedding shifts accordingly. The drift is not in the vocabulary; it is in the context that the embedding model attends to.

Jurafsky & Martin, SLP3 §10.3. Read SLP3

The Post-Mortem Template

27Matching detection metrics to failure classes ↑ back

Jurafsky and Martin's discussion of LLM evaluation in SLP3 §7.6 and QA evaluation in §11.6 provides formal tools for the "detection gap" field in this template. For generation quality, they describe perplexity and benchmark accuracy. For retrieval quality, they describe exact match and F1 score (§11.6). The post-mortem template implicitly asks: which of these metrics would have caught the failure? The answer often reveals that no standard metric applies. Prompt drift is not captured by perplexity. Tool cascade failures are not captured by retrieval F1. Silent degradation is not captured by benchmark accuracy. The detection gap question forces teams to invent new metrics for each failure class, which is precisely the discipline that prevents recurrence.

Jurafsky & Martin, SLP3 §7.6, §11.6. Read SLP3

The Common Thread

28Begging the question and plausible continuation ↑ back

Farris et al. warn that 'begging the question' exploits are easy: LLMs will confidently explain false premises because they're optimized for plausible continuation, not contradiction detection. Many post-mortem failures trace back to this fundamental asymmetry.

See GH #3, Farris et al., Ch. 4.

29Intrinsic versus extrinsic hallucination ↑ back

Jurafsky and Martin define hallucination in SLP3 §7.7 and distinguish two types: intrinsic (contradicting the source) and extrinsic (making claims unsupported by any source). The six failure modes in this article map onto this taxonomy in illuminating ways. Retrieval poisoning produces intrinsic hallucination (the model contradicts the actual case law by reconstructing fake citations from fragments). Prompt drift produces extrinsic hallucination (the model generates claims about product features that exist nowhere in its context). The common thread is that hallucination is not a single failure mode but a family of failures with different root causes, and each requires a different mitigation strategy.

Jurafsky & Martin, SLP3 §7.7. Read SLP3

30Collapsed architectures and lost component boundaries ↑ back

Widdows and Cohen reinforce this point with a historical parallel. In Ch. 6.1.1, they describe a traditional architecture where factual accuracy and language fluency were separate responsibilities: a knowledge base stored facts, and a language model turned them into text (Figure 6.2). Modern LLMs collapsed these into one component, which is why failures at component boundaries are so dangerous now. The Galactica example (Ch. 6.1.1), where a model trained on scientific literature generated false medical claims about Ivermectin, illustrates what happens when the boundary between "retrieves knowledge" and "generates plausible text" disappears entirely.

Widdows & Cohen, Ch. 6.1.1. Issue #45