Sources

Grounding, citations, and further reading for The Anatomy of a Prompt.

All of this is optional. These are the sources I used to write the course, shown here as grounding for the research behind the article. Nothing on this page is required reading, and you do not need to purchase any of these books.

The article itself is self-contained. This page exists so that the work is properly cited and so that anyone who wants to go deeper on a specific topic knows where to look.

About the Sources

SLP3: Jurafsky & Martin

Jurafsky, Daniel & James H. Martin. Speech and Language Processing, 3rd ed. (draft).

Freely available at web.stanford.edu/~jurafsky/slp3/. Cited extensively for formal treatments of conditional generation, system prompts, few-shot/zero-shot taxonomy, in-context learning, temperature sampling, and perplexity.

Widdows & Cohen: Large Language Models

Widdows, Dominic & Trevor Cohen. SemanticVectors Publishing, 2025.

Cited for sentence-to-instruction transformation, vocabulary probability mechanics, few-shot history, conditional probability, hallucination, KV-cache, chain-of-thought empirics, prefix-tuning, and RAG.

Alammar & Grootendorst: Hands-On Large Language Models

Alammar, Jay & Maarten Grootendorst. O'Reilly Media, 2024.

Referenced for seven prompt components (persona, instruction, context, format, audience, tone, data) and in-context learning as "an example is worth a thousand words."

Raschka: Build a Large Language Model (From Scratch)

Raschka, Sebastian. Manning, 2024.

Referenced for instruction fine-tuning template structure (Alpaca format).

Farris et al.

Referenced for the caveat that few-shot learning "isn't really learning" since weights stay identical.

The Anatomy of a Prompt

6Sentence-to-instruction transformation

Widdows and Cohen provide important context here. In Section 5.2.3, they describe the transformation from "finishing sentences to following instructions" and find it remarkable how little additional training data is required to convert a next-token predictor into an instruction-follower. Using LLaMA as a case study, they show that just 52,000 prompt-response pairs (40 MB, or 100,000x less data than pretraining) sufficed.

Widdows & Cohen, §5.2.3.

7Conditional generation

Jurafsky & Martin formalize this intuition in SLP3 §7.2 as conditional generation: "almost anything we want to do with language can be modeled as conditional generation of text." The model computes P(w_i|w_<i) at each step, generating tokens conditioned on the prompt and its own prior outputs.

SLP3 §7.2. Read SLP3

8The prompting spectrum

Jurafsky & Martin formally distinguish three points on the prompting spectrum in SLP3 §7.3 (p.7-8): zero-shot prompting, few-shot prompting (also called demonstrations), and the broader category of in-context learning, which they define as "learning that improves model performance or reduces some loss but does not involve gradient-based updates to the model's underlying parameters."

SLP3 §7.3. Read SLP3

9Vocabulary probability mechanics

Widdows and Cohen explain the mechanics in Section 5.2.2. The representation of the last token is compared against output embeddings for every token in the vocabulary via a scalar product, producing logits that are converted to probabilities through a softmax function.

Widdows & Cohen, §5.2.2.

The Three Layers

10Seven prompt components

Alammar & Grootendorst enumerate seven prompt components: persona, instruction, context, format, audience, tone, and data. They stress that iterative refinement is essential.

Alammar & Grootendorst, Ch. 6.

System Prompts

11System prompts as an intervention point

The textbook identifies system prompts as one of four intervention points for constraining LLM behavior. The others are curating training data, altering base model training, and intercepting outputs with code. System prompts are the cheapest and most accessible lever.

Farris et al., Ch. 5.

12Formal definition of system prompts

Jurafsky & Martin define a system prompt as "a single text prompt that is the first instruction to the language model, and which defines the task or role for the LM, and sets overall tone and context." The system prompt is "silently prepended to any user text." They provide the example of Anthropic's Claude system prompt at 1700 words.

SLP3 §7.3. Read SLP3

13The logit mechanism

The mechanism is formalized in SLP3 §7.4 (Eq. 7.1): the model produces a logit vector u of shape [1 x |V|] for each token, then normalizes via softmax to produce y = softmax(u). The system prompt alters the context that produces these logits, reshaping the entire probability distribution.

SLP3 §7.4. Read SLP3

14Conditional probability and Bayes

Widdows and Cohen introduce conditional probability formally in Chapter 1 using a spam-filtering example: P(x|y) denotes the probability of observing word x given category y. The same conditional probability framework underlies how system prompts work.

Widdows & Cohen, Ch. 1.

15Hallucination by design

Widdows and Cohen argue in Chapter 6 that language models are designed to generate text that is plausible rather than factually accurate. The Galactica model generated a convincing but fabricated scientific abstract about Ivermectin treating COVID-19. Since system prompts operate within this probabilistic framework, they can bias but never guarantee truthful output.

Widdows & Cohen, Ch. 6.

Few-Shot Examples

16Not really learning

Farris et al. caution that few-shot learning "isn't really learning." The model weights stay identical. It's prompt engineering, not training.

Farris et al., Ch. 7.

17GPT-3 and the shattered presumption

Widdows and Cohen note that GPT-3's few-shot performance shattered "an established presumption that machine learning models were good for particular specialized tasks, but each task required dedicated training."

Widdows & Cohen, §5.1.

18Formal definition of in-context learning

Jurafsky & Martin define in-context learning as "learning that improves model performance or reduces some loss but does not involve gradient-based updates to the model's underlying parameters." Prompts can be viewed as a learning signal. "The weights of the model are not updated by prompting; what changes is just the context and the activations in the network."

SLP3 §7.3. Read SLP3

19Romeo and context-sensitive pattern completion

Widdows and Cohen illustrate how attention enables context-sensitive pattern completion: the same word "Romeo" produces entirely different next-token predictions depending on whether context is Shakespearean or automotive.

Widdows & Cohen, Ch. 7, §4.3.

20Incorrect demonstrations still help

SLP3 §7.3 provides striking empirical support: "demonstrations that have incorrect answers can still improve a system." The primary benefit "seems more to demonstrate the task and the format of the output rather than demonstrating the right answers."

SLP3 §7.3 (citing Min et al., 2022). Read SLP3

21Diminishing returns from more demonstrations

Jurafsky & Martin note: "The number of demonstrations doesn't need to be large; more examples seem to give diminishing returns, and too many examples seems to cause the model to overfit to the exact examples." They also mention DSPy for automated demonstration selection.

SLP3 §7.3. Read SLP3

22An example is worth a thousand words

Alammar & Grootendorst describe in-context learning as "an example is worth a thousand words," emphasizing that few-shot demonstrations constrain output format and label space without updating model weights.

Alammar & Grootendorst, Ch. 6.

23PagedAttention and computational overhead

Widdows and Cohen discuss PagedAttention, which maintains the KV cache across generation steps, avoiding re-reading the entire prompt for every new token. While few-shot examples still consume context window space, the computational overhead has been significantly reduced.

Widdows & Cohen, §5.3.

Chain-of-Thought

24Chain-of-thought history

Widdows and Cohen note that Google's PaLM (540B parameters, 2022) "popularized chain-of-thought prompting." They frame CoT as "think before you speak," alongside test-time scaling.

Widdows & Cohen, §5.1, §5.2.4.

25Chain-of-thought worked example

Widdows and Cohen prompt LLaMA-3 405B to find prime palindromes below 1000. Without CoT, the model guesses 4-5. With "think step by step," it gets ~15. With LoRA fine-tuning on 1,000 reasoning examples, it reaches the correct 20.

Widdows & Cohen, §5.2.4.

26Temperature and self-consistency

SLP3 §7.4.3: Temperature sampling reshapes the distribution by dividing logits by τ before softmax. τ approaching 0 collapses to greedy; τ=1 is standard; τ>1 flattens. Self-consistency needs moderate temperature for diverse reasoning paths.

SLP3 §7.4.3. Read SLP3

Putting It Together

27The Alpaca template

Raschka shows the Alpaca template structure: ### Instruction:, optional ### Input:, and ### Response:. The template format is a form of prompt anatomy baked into training data.

Raschka, Ch. 7.

28Perplexity and evaluation

SLP3 §7.6 describes perplexity as the standard intrinsic metric for language models. For downstream evaluation, MMLU covers 15,908 questions across 57 domains. Data contamination is flagged as a major concern.

SLP3 §7.6. Read SLP3

The Deeper Lesson

29Prefix-tuning

Widdows and Cohen describe prefix-tuning as "inspired by analogy with natural language prompt-prefixes." Instead of engineering the right words, prefix-tuning trains a small network to produce prefix vectors. This suggests prompting is a natural-language approximation of a deeper computational operation.

Widdows & Cohen, §5.3.4.

30RAG as systematic prompt augmentation

Widdows and Cohen describe RAG with the Romeo example: a Shakespeare-trained model asked "What are good alternatives to the Romeo?" suggests Juliet and Mercutio. RAG injects car magazine text so domain terms steer toward automotive answers. RAG is systematic prompt augmentation.

Widdows & Cohen, §5.3.3.

31The three-stage LLM training pipeline

SLP3 §7.5 contextualizes prompting within the full three-stage LLM training pipeline: pretraining, instruction tuning (SFT), and alignment via preference optimization. Prompting operates after all three stages. Instruction tuning specifically trains models to be "very good at following instructions."

SLP3 §7.5. Read SLP3

Paper Citations

Numbered by order of appearance in article.

1Brown et al. (2020)

Brown, Tom B. et al. "Language Models are Few-Shot Learners." NeurIPS, 2020. arxiv.org/abs/2005.14165

2Wei et al. (2022)

Wei, Jason et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS, 2022. arxiv.org/abs/2201.11903

3Kojima et al. (2022)

Kojima, Takeshi et al. "Large Language Models are Zero-Shot Reasoners." NeurIPS, 2022. arxiv.org/abs/2205.11916

4Wang et al. (2022)

Wang, Xuezhi et al. "Self-Consistency Improves Chain of Thought Reasoning in Language Models." ICLR, 2022. arxiv.org/abs/2203.11171

5Liu et al. (2021)

Liu, Jiachang et al. "What Makes Good In-Context Examples for GPT-3?" DeeLIO Workshop, ACL, 2021. arxiv.org/abs/2101.06804