Sources
Grounding, citations, and further reading for When Prompts Fail.
All of this is optional. These are the sources I used to write the course, shown here as grounding for the research behind the article. Nothing on this page is required reading, and you do not need to purchase any of these books.
The article itself is self-contained. This page exists so that the work is properly cited and so that anyone who wants to go deeper on a specific topic knows where to look.
About the Sources
SLP3: Jurafsky & Martin
Freely available at web.stanford.edu/~jurafsky/slp3/. Chapter 7 covers generation, prompting, evaluation, and safety. Provides the formal definitions of hallucination, the mathematics of temperature sampling, and the evaluation frameworks that underpin prompt debugging.
Widdows & Cohen: Large Language Models
Accessible survey strong on the architectural roots of prompt failure: why hallucination is a design consequence rather than a bug, why instruction-following is a thin layer over sentence completion, and why the "hallucination" label itself is contested.
Ji et al.: Survey of Hallucination
Comprehensive taxonomy of hallucination types in natural language generation, covering intrinsic and extrinsic hallucination, evaluation methods, and mitigation strategies.
The Silent Failure Problem
1Why prompt failures look fluent rather than broken
SLP3 section 7.4 (pp. 10-12) formalizes why prompts fail "silently." Generation uses either greedy decoding (Eq. 7.2), which always picks the most probable token, or random sampling, which draws tokens proportional to their probability. Neither strategy has any mechanism for verifying correctness. The model produces statistically plausible continuations, which is why failures look fluent rather than broken. There is no error signal in the generation process; only a probability distribution over possible next tokens.
SLP3 section 7.4, Eqs. 7.2-7.4. Read SLP3 ↩ Back to article
Failure Mode 1: Hallucination
2The formal definition of hallucination
SLP3 section 7.7 (p. 22) defines hallucination formally: "LLMs are prone to saying things that are false, a problem called hallucination. Language models are trained to generate text that is predictable and coherent, but the training algorithms we have seen so far don't have any way to enforce that the text that is generated is correct or true." Jurafsky and Martin also note that hallucination can manifest as suggesting unsafe actions, citing Bickmore et al. (2018) where commercial dialogue systems proposed actions that "if actually taken, would have led to harm or death." The problem predates LLMs.
SLP3 section 7.7. Read SLP3 ↩ Back to article
3Hallucination as architectural feature, not bug
Widdows and Cohen argue in Ch. 6 that hallucination is an architectural feature, not a bug. Language models were originally designed for translation, where converting "J.S. Bach was born in 1985" into German should produce the German equivalent, not flag a factual error. The traditional design separated factuality (a knowledge base) from fluency (the language model). Modern LLMs collapse both responsibilities into one system, which is why they produce fluent falsehoods.
Widdows & Cohen, Ch. 6. ↩ Back to article
4Fabricated citations predate chatbots
Widdows and Cohen note in Ch. 6 that fabricated citations are not unique to LLMs. They cite a case where a "phantom reference" to an imaginary article was copied verbatim over 400 times by human authors, long before chatbots existed. They also provide a striking example of the Galactica model (trained exclusively on scientific literature) generating unsubstantiated medical claims about Ivermectin and COVID-19 in the style of a scientific abstract: plausible in form, dangerous in content.
Widdows & Cohen, Ch. 6. ↩ Back to article
5Taxonomy of hallucination types
Ji et al. provide the most comprehensive taxonomy of hallucination in NLG systems. They distinguish intrinsic hallucination (output that contradicts the source) from extrinsic hallucination (output that cannot be verified from the source), and catalog evaluation methods and mitigation strategies across both categories. Their survey establishes that hallucination is not a single failure mode but a family of related problems with different root causes.
Ji, Z., et al. (2023). "Survey of Hallucination in Natural Language Generation." ACM Computing Surveys. ↩ Back to article
6Temperature reduction has a precise formal basis
SLP3 section 7.4.3 (Eq. 7.4, p. 12) defines temperature sampling as y = softmax(u/tau). As tau approaches 0, the distribution sharpens: the most probable token's probability approaches 1, converging on greedy decoding. Figure 7.11 shows that at tau = 0.1, the top candidate jumps from .44 to .95 probability. For factual tasks, this concentration reduces the chance of sampling low-probability tokens that constitute hallucinated content.
SLP3 section 7.4.3, Eq. 7.4, Fig. 7.11. Read SLP3 ↩ Back to article
7RAG is a computational compromise, not a factual guarantee
Widdows and Cohen offer an important caveat about RAG in Ch. 5 (Sec. 5.3.3). While RAG improves factual grounding by combining a language model with an information retrieval system, they warn it is easily misinterpreted. The use of domain-specific search results helps produce more factual answers, but "it doesn't mean that these answers are produced directly from a database of established facts." RAG reduces hallucination but does not eliminate it.
Widdows & Cohen, section 5.3.3. ↩ Back to article
8Is "hallucination" even the right word?
Widdows and Cohen challenge the term "hallucination" itself in Ch. 6. They note that Turing called the phenomenon error of conclusion, and cognitive scientist Christopher Summerfield argues the behavior is closer to what humans call confabulation, a much less alarming term. They also point out that we hold LLMs to higher factual standards than much of their training data: political speeches, press releases, and marketing copy are routinely misleading, yet we do not call those "hallucinations."
Widdows & Cohen, Ch. 6. ↩ Back to article
Failure Mode 2: Refusal
9Refusal and sycophancy are two sides of alignment calibration
SLP3 section 7.7 (p. 23) identifies sycophancy, where models "excessively agree with or flatter users. When a user says something that is factually wrong, language models often agree with them instead of correcting them." Jurafsky and Martin cite Cheng et al. (2025) on how sycophantic behavior "can reinforce delusions." SLP3 section 7.5 (p. 14) traces this to the third training stage, preference alignment, where the model is trained via reinforcement learning to produce "accepted" continuations and avoid "rejected" ones. Refusal and sycophancy are two extremes of the same calibration problem.
SLP3 section 7.7, section 7.5. Read SLP3 ↩ Back to article
10The space of possible conversations is too broad to cordon off
Widdows and Cohen describe the opposite failure mode of refusal in Ch. 6: LLM sycophancy, where models trained as assistants tend to agree with viewpoints presented to them rather than challenge them. They connect this to real-world cases where guardrails failed to prevent harmful outcomes, despite safety systems being in place. The space of possible conversations is "too broad to cordon off exhaustively for safety." This suggests refusal and sycophancy are two extremes of the same calibration problem, not two separate bugs.
Widdows & Cohen, Ch. 6. ↩ Back to article
Failure Mode 3: Instruction Drift
11Attention was not designed to enforce instruction persistence
Widdows and Cohen provide the technical context in Ch. 4 and Ch. 5. Attention was originally designed to let a decoder "focus on different combinations of input tokens at different steps," relieving the encoder from having to compress everything into a fixed-length vector. The attention weights are continuous and learned; there is no hard rule about which tokens to attend to. Ch. 5 (Table 5.1) shows context window sizes growing from 512 tokens (BERT, 2018) to 10 million tokens (LLaMA-4, 2025), but larger windows do not eliminate drift. They just delay it.
Widdows & Cohen, Ch. 4, Ch. 5, Table 5.1. ↩ Back to article
12Instruction-following is a thin, fragile layer
Widdows and Cohen make a fascinating observation in Ch. 5 (Sec. 5.2.3). Converting a sentence-completion model into an instruction-following one requires remarkably little additional training: just 52,000 prompt/response pairs, or 100,000 times less data than the original pretraining. They call this "extraordinary and mysterious." The implication for instruction drift is that instruction-following is a thin layer on top of the model's core sentence-completion behavior, making it inherently fragile over long conversations.
Widdows & Cohen, section 5.2.3. ↩ Back to article
Failure Mode 4: Format Non-Compliance
13Format constraints compete against a prose-dominated prior
SLP3 section 7.5.1 (pp. 14-15) shows that pretraining minimizes cross-entropy loss (Eq. 7.6), trained on "text scraped from the web" (section 7.5.2), which is overwhelmingly natural prose. When you request JSON output, you are asking the model to produce tokens that may have lower probability under this prose-dominated training distribution. The format constraint competes against the model's learned prior, and at points of uncertainty, the prior sometimes wins.
SLP3 section 7.5.1, Eq. 7.6. Read SLP3 ↩ Back to article
14The model predicts plausible continuations, not compliant ones
Widdows and Cohen describe autoregressive generation in Ch. 5 as "reading in a prompt, generating a new token, and then adding that token to the prompt and repeating the process." The model's fundamental operation is predicting the most statistically plausible next token. In Ch. 1, they note that models "know which words can be swapped around while still sounding plausible" but are "quite bad at knowing which of those plausible-sounding permutations make accurate claims," or, by extension, which conform to a requested format.
Widdows & Cohen, Ch. 1, Ch. 5. ↩ Back to article
Failure Mode 5: Prompt Injection
15Compromising LLM-integrated applications through retrieved content
Greshake et al. demonstrated that prompt injection does not require direct user access to the prompt. By embedding malicious instructions in web pages, emails, or documents that an LLM-integrated application would retrieve, attackers can hijack model behavior indirectly. Their work established indirect prompt injection as a distinct attack class, separate from the direct "ignore previous instructions" approach, and showed it compromising real-world applications including Bing Chat.
Greshake, K., et al. (2023). "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection." ↩ Back to article
16Prompt injection as the top LLM vulnerability
The OWASP Top 10 for Large Language Model Applications ranks prompt injection as the number-one vulnerability in deployed LLM systems. The framework catalogs attack vectors, impact assessments, and mitigation strategies across ten vulnerability categories, providing a standardized risk taxonomy for practitioners building LLM-integrated applications.
OWASP Foundation. (2025). OWASP Top 10 for Large Language Model Applications. ↩ Back to article
17LLMs as agents increase the attack surface
SLP3 section 7.7 (p. 24) frames prompt injection within a broader landscape of LLM misuse. Jurafsky and Martin note that "LLMs may carry out additional harmful activities themselves, especially as agent-based paradigms makes it possible for language models to directly interact with the world." The chapter emphasizes that safety mitigation must happen at multiple levels, from training data curation through alignment training to application-level guardrails, echoing the "defense in depth" approach.
SLP3 section 7.7. Read SLP3 ↩ Back to article
18Using models to manipulate people
Widdows and Cohen discuss a related concern in Ch. 6: LLMs being deliberately exploited for social engineering. They cite a Microsoft Research paper showing GPT-4 (before alignment training) generating a multi-step plan to use misinformation to persuade parents not to vaccinate their children, including infiltration of target online communities and emotional appeals. This is the offensive counterpart to prompt injection: rather than manipulating the model, the attacker uses the model to manipulate people.
Widdows & Cohen, Ch. 6. ↩ Back to article
The Debugging Protocol
19Formal evaluation underpins the test-suite approach
SLP3 section 7.6 (pp. 19-22) provides a formal evaluation framework. Jurafsky and Martin describe three dimensions: perplexity (how well the model predicts unseen text, Eq. 7.10-7.11), downstream task accuracy (e.g., MMLU's 15,908 questions across 57 domains), and non-accuracy factors like fairness, energy use, and model size. They also flag data contamination as a critical concern: "when some part of a dataset that we are testing on makes its way into our training set," inflating evaluation scores. For prompt debugging, these methods provide the quantitative backbone for the test-suite approach.
SLP3 section 7.6, Eqs. 7.10-7.11. Read SLP3 ↩ Back to article
20Change one variable at a time: a worked example
Widdows and Cohen demonstrate the "change one variable" principle concretely in Ch. 5 (Sec. 5.2.4). They show the same palindromic-primes question answered at different temperature settings and with/without chain-of-thought prompting. A temperature of 0.2 "constrains the sampling distribution such that a small number of high-logit tokens are most likely to be selected," while higher temperature produced errors. Adding "think step by step" improved the answer from 4-5 to approximately 15. Each change was tested independently.
Widdows & Cohen, section 5.2.4. ↩ Back to article
Building the Muscle
21Chain-of-thought as a debugging and accuracy tool
Wei et al. showed that prompting a model to "think step by step" before producing a final answer measurably improves accuracy on multi-step reasoning tasks. This technique is both a mitigation strategy (forcing explicit reasoning reduces certain failure modes) and a debugging tool (the intermediate steps reveal where the model's reasoning breaks down, making failures diagnosable rather than opaque).
Wei, J., et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS. ↩ Back to article