Sources

Grounding, citations, and further reading for The Prompt Engineer's Pattern Book.

All of this is optional. These are the sources behind the article. Nothing on this page is required reading, and you do not need to purchase any of these books.

The article itself is self-contained. This page exists so that the work is properly cited and so that anyone who wants to go deeper on a specific pattern knows where to look.

About the Sources

SLP3: Jurafsky & Martin

Jurafsky, Daniel & James H. Martin. Speech and Language Processing, 3rd ed. (draft).

The standard academic textbook for NLP. Freely available in draft form at web.stanford.edu/~jurafsky/slp3/. Chapter 7 provides the canonical treatment of prompting, decoding strategies, and temperature, and is the main textbook grounding for this article.

Widdows & Cohen: Large Language Models: How They Work and Why They Matter

Widdows, Dominic & Trevor Cohen. SemanticVectors Publishing, 2025.

Accessible and mathematically grounded survey of LLM architecture and behavior. Particularly strong on how personas bias contextual embeddings, why templates work, and the tradeoffs between sampling temperature and reasoning reliability.

Alammar & Grootendorst: Hands-On Large Language Models

Alammar, Jay & Maarten Grootendorst. O'Reilly Media, 2024.

Practitioner-oriented survey from the author of The Illustrated Transformer. Strong on applied patterns: sequential prompt chains, the chain-of-thought to self-consistency to tree-of-thought progression, and concrete engineering tradeoffs in deployed systems.

Raschka: Build a Large Language Model (From Scratch)

Raschka, Sebastian. Manning Publications, 2024.

Hands-on walk-through of the mechanics behind instruction-following models. Useful grounding for why template formats like Alpaca and Phi-3 matter, and how template choice affects both training efficiency and output quality.

Introduction

6A Prompt Pattern Catalog ↩ Back to article

White et al. propose a design-patterns style catalog for prompt engineering, explicitly drawing the analogy to Gang-of-Four software patterns. The patterns in this article (persona, template, meta-prompting, self-consistency) align with categories in the catalog, but the article narrows to the four that experienced practitioners reach for most often. The catalog is a useful extended reference for readers who want a broader taxonomy.

White, J., et al. (2023). "A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT." arXiv preprint.

7Prompt engineering as a named activity ↩ Back to article

Jurafsky and Martin formally define the activity underlying these patterns in SLP3 §7.3 (p.7): "A prompt is a text string that a user issues to a language model to get the model to do something useful. In prompting, the user's prompt string is passed to the language model, which iteratively generates tokens conditioned on the prompt. The process of finding effective prompts for a task is known as prompt engineering." Each pattern in this article is a reusable strategy within that engineering process.

SLP3 §7.3. Read SLP3

Persona Prompting

8Personas as output-level intervention ↩ Back to article

Alammar and Grootendorst frame all prompt engineering as part of general-purpose LLM programming, where you constrain behavior through carefully designed inputs rather than changing model weights. Personas are a form of output interception: the fourth of the book's four behavioral intervention points, alongside retrieval, tool use, and fine-tuning. Personas operate entirely at the prompt layer and require no model modification.

Alammar & Grootendorst, Ch. 5. See also GH #3.

9Conditional generation and prompt format ↩ Back to article

The formal basis for persona prompting is SLP3 §7.2's concept of conditional generation: "we give the LLM an input piece of text, a prompt, and then have the LLM continue generating text token by token, conditioned on the prompt" (p.5). A persona is a particular kind of conditioning context. J&M show in §7.3 (p.7) how even the structure of the prompt matters: a question framed as "Q: ... A:" versus a plain question shifts the probability distribution differently, because the model has seen these formats in its training data. A persona prompt works the same way, activating text-completion patterns associated with expert discourse.

SLP3 §7.2-7.3. Read SLP3

10Pretraining corpora and domain activation ↩ Back to article

SLP3 §7.5.2 (p.17) explains why persona prompting activates domain-specific knowledge. Pretraining corpora like The Pile (825 GB) and Common Crawl include vast amounts of domain-specific text: PubMed Central for medical writing, ArXiv for scientific papers, StackExchange for technical Q&A, and FreeLaw for legal text (see Fig. 7.14). A persona like "licensed CPA" biases the model toward completing text in the statistical patterns learned from tax and accounting documents within these corpora. The persona does not inject new knowledge; it activates a subset of what was already learned during pretraining.

SLP3 §7.5.2, Fig. 7.14. Read SLP3

11The geometry of contextual embeddings ↩ Back to article

Widdows and Cohen provide the mechanistic detail for why personas work. In §5.2.2 they show that the final contextual token embedding is compared via scalar product against every token in the vocabulary, then passed through softmax to produce a probability distribution. A persona shifts the contextual embedding into a region of vector space closer to domain-specific output tokens, literally biasing which tokens have high logits. Their Figure 5.5 visualizes this: the phrase "A large language" clusters near model and modeling, while "A foreign language" clusters near learning and community.

Widdows & Cohen, §5.2.2, Fig. 5.5. See also GH #45.

12The Galactica confabulation problem ↩ Back to article

Widdows and Cohen illustrate the hallucination danger vividly in Ch. 6 with the Galactica model, which was trained exclusively on scientific literature. When prompted, it generated entirely fabricated claims about Ivermectin treating COVID-19 in the style of a scientific abstract. The expert persona made the falsehood more persuasive, not less. They write that "plausibility in and of itself can be persuasive and text that appears to come from an authoritative source could lead to misguided and even harmful medical decisions," and they prefer the term confabulation over hallucination.

Widdows & Cohen, Ch. 6. See also GH #45.

13Expert imitation and the Turing test ↩ Back to article

Widdows and Cohen offer a striking thought experiment in Ch. 6: an expert version of Turing's imitation game where a human expert, a non-expert human, and a chatbot each try to convince a judge they are, say, a plumber or a travel agent. They argue that the non-expert human would be eliminated first every time, because "the range of tasks at which any state-of-the-art chatbot would clearly beat any human, however well trained or educated, keeps growing." The model's ability to impersonate domain experts is already outpacing non-specialists, which sharpens the warning that confident personas can make hallucinations harder to detect.

Widdows & Cohen, Ch. 6. See also GH #45.

Template Patterns

14Templates and explicit answer sets ↩ Back to article

Templates work partly because models are explicitly trained on structured prompts during instruction tuning. SLP3 §7.3 (p.7) shows a concrete example: a sentiment analysis template that specifies "Human:" and "Assistant:" roles, enumerated choices "(P) Positive" and "(N) Negative," and an open parenthesis that strongly biases the model to respond with one of the allowed answers. J&M note that "more explicit prompts that specify the set of possible answers lead to better performance" (§7.3, p.7). The template pattern formalizes this observation into a reusable engineering practice.

SLP3 §7.3. Read SLP3

15Sequential chains and decomposed tasks ↩ Back to article

Alammar and Grootendorst describe sequential chains that break complex tasks into subtasks, where each step's output feeds the next step's input. This mirrors the template composition pattern in the article, where focused single-purpose templates are chained into multi-step pipelines. They emphasize that complexity should be distributed across the chain rather than concentrated in a single prompt, which also makes each step independently testable.

Alammar & Grootendorst, Ch. 7. See also GH #5.

16Why templates work at the model level ↩ Back to article

Widdows and Cohen provide useful context on how templates work inside the model. In §5.2.3, they show that a 65B LLaMA model was converted from raw next-token prediction to instruction-following using just 52,000 prompt/response pairs (about 40 MB of data, 100,000 times less than pretraining). The training data itself was structured as templates with Instruction, Input, and Output fields. They note it is "remarkable how little additional training data are required to transform a language generating model into an instruction following one." This explains why template patterns are so effective: the models were literally trained on them. Raschka separately shows how structured formats like Alpaca (### Instruction: / ### Input: / ### Response:) and Phi-3 tokens (<|user|> / <|assistant|>) each bake prompt patterns into the model.

Widdows & Cohen, §5.2.3; Raschka, Ch. 7. See also GH #45 and GH #4.

Meta-Prompting

4APE: Large language models as prompt engineers ↩ Back to article

Zhou et al. introduce Automatic Prompt Engineer (APE), a system that treats prompt design as a black-box optimization problem. APE uses an LLM to propose candidate prompts, scores them against a held-out evaluation set, and iterates until convergence. On several benchmarks, APE-generated prompts match or exceed human-written ones, often including phrasings that humans would not have tried. The paper is the primary reference for the meta-prompting loop described in the article.

Zhou, Y., et al. (2022). "Large Language Models Are Human-Level Prompt Engineers." ICLR.

17DSPy and automated demonstration selection ↩ Back to article

SLP3 §7.3 (p.8) describes a related approach to automated prompt optimization: DSPy (Khattab et al., 2024), which "compil[es] declarative language model calls into self-improving pipelines" by automatically choosing "the set of demonstrations that most increases task performance of the prompt on a dev set." Both APE and DSPy validate the meta-prompting insight that prompt quality can be optimized programmatically, but DSPy focuses specifically on demonstration selection rather than instruction rewriting. The two approaches are complementary.

SLP3 §7.3; Khattab et al., 2024. Read SLP3

18When prompting alone is not enough ↩ Back to article

Widdows and Cohen discuss two alternatives for when prompting cannot solve the problem. In §5.3.3, they describe RAG (Retrieval Augmented Generation), which augments prompts with retrieved domain-specific text rather than relying on the model's internal knowledge. In §5.3.4, they describe prefix-tuning, which was explicitly "inspired by analogy with natural language prompt-prefixes, such as Please summarize: <Text>." Both techniques go beyond what meta-prompting can achieve, addressing the model's knowledge gap rather than its instruction-following gap.

Widdows & Cohen, §5.3.3-5.3.4. See also GH #45.

19Template-shaped training data ↩ Back to article

Raschka shows that structured instruction formats like the Alpaca template (### Instruction: / ### Input: / ### Response:) are themselves prompt patterns, baked into training data. The Phi-3 alternative uses <|user|> and <|assistant|> tokens. Template choice affects both training efficiency (roughly 17 percent difference across runs) and final output quality, reinforcing the article's point that the failure mode is sometimes the model's training assumptions, not the prompt itself.

Raschka, Ch. 7. See also GH #4.

Self-Consistency

3Self-consistency improves chain of thought ↩ Back to article

Wang et al. introduce self-consistency as an extension of chain-of-thought prompting. Rather than committing to a single reasoning path, the system samples multiple chains at elevated temperature and takes the majority answer among the final predictions. Gains are most pronounced on arithmetic, commonsense, and symbolic reasoning benchmarks. This paper is the primary source for the technique described in the Self-Consistency section.

Wang, X., et al. (2022). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." ICLR.

2Chain-of-thought prompting ↩ Back to article

Wei et al. establish chain-of-thought prompting as a general technique: providing or eliciting intermediate reasoning steps before the final answer. Self-consistency builds directly on this by sampling multiple chains rather than committing to one. Together the two papers define the reasoning-pattern baseline that most modern prompt frameworks refine or extend.

Wei, J., et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS.

20Greedy decoding vs. random sampling ↩ Back to article

The contrast between single-sample and multi-sample approaches maps onto two decoding strategies formalized in SLP3 §7.4. Greedy decoding (§7.4.1, Eq. 7.2) always selects the highest-probability token: ŵ_t = argmax P(w|w_<t). It is deterministic, meaning every run produces the same output, making majority voting impossible. Random sampling (§7.4.2) instead draws tokens from the distribution according to their probabilities. J&M note that "greedy decoding is too boring, and random sampling is too random" (p.12), motivating temperature sampling as the middle ground that makes self-consistency viable.

SLP3 §7.4. Read SLP3

21Temperature as softmax rescaling ↩ Back to article

SLP3 §7.4.3 (p.12-13) formalizes the temperature mechanism as y = softmax(u/τ) (Eq. 7.4), where τ is the temperature parameter. J&M provide a concrete numerical example (Fig. 7.11, p.14): with logits [1.2, 0.9, 0.1, -0.5], setting τ = 0.5 shifts probabilities from [.44, .33, .15, .08] to [.59, .32, .07, .02], and τ = 0.1 pushes them to [.95, .05, 0, 0]. The intuition comes from thermodynamics: a system at a high temperature is flexible and can explore many possible states, while a system at a lower temperature tends to explore lower-energy (better) states. The 0.5 to 0.7 range recommended for self-consistency keeps the distribution diverse enough for varied reasoning paths while concentrated enough to avoid incoherence.

SLP3 §7.4.3, Eq. 7.4, Fig. 7.11. Read SLP3

22A concrete temperature-tradeoff example ↩ Back to article

Widdows and Cohen provide a concrete example of the temperature tradeoff in §5.2.4. When asking LLaMA-3 to count prime palindromes below 1000, a temperature of 0.2 "constrains the sampling distribution such that a small number of high-logit tokens are most likely to be selected" and produced the correct answer. An earlier run with higher temperature "showed the same reasoning strategy, but produced an overestimate because some erroneous palindromes (such as 13) were included in the final count." This directly illustrates why self-consistency needs moderate temperature: too low yields identical outputs, too high introduces the errors you want the ensemble to filter out.

Widdows & Cohen, §5.2.4. See also GH #45.

23CoT, self-consistency, and tree-of-thought ↩ Back to article

Alammar and Grootendorst cover chain-of-thought ("let's think step-by-step"), self-consistency (majority voting across multiple sampled outputs), and tree-of-thought (exploring multiple reasoning branches) as a progression of increasingly robust prompting strategies. Self-consistency is presented as the natural reliability layer on top of CoT, and tree-of-thought as the next step up when a single linear chain is not enough. The article follows this framing when distinguishing tasks where self-consistency applies from tasks where it does not.

Alammar & Grootendorst, Ch. 6. See also GH #5.

24Test-time scaling and zero-shot CoT ↩ Back to article

Widdows and Cohen describe an alternative to self-consistency for improving reasoning: test-time scaling, which "devotes extra computational resources at inference time to check and revise answers" (§5.2.4). They also show that zero-shot chain-of-thought prompting ("think step by step") led a model to define key terms and decompose the problem into units before answering, improving accuracy even without multiple samples. Combined with LoRA fine-tuning on just 1,000 reasoning examples, the model learned to break problems down reliably. This suggests a spectrum of reliability strategies beyond majority voting.

Widdows & Cohen, §5.2.4; Kojima et al. (2022). See also GH #45.

5Large language models as zero-shot reasoners ↩ Back to article

Kojima et al. show that a simple trigger phrase ("let's think step by step") is enough to elicit multi-step reasoning from sufficiently large models, without any few-shot demonstrations. This result is what allows self-consistency to work in a zero-shot regime: you can sample multiple reasoning chains using only the trigger, then vote on final answers. The paper is a key companion to Wei et al. and Wang et al. in the self-consistency literature.

Kojima, T., et al. (2022). "Large Language Models are Zero-Shot Reasoners." NeurIPS.

Closing

1Language models are few-shot learners ↩ Back to article

Brown et al.'s GPT-3 paper is the foundational reference for in-context learning and few-shot prompting. It established that large language models can be steered through examples in the prompt rather than through gradient updates, which is the precondition for every pattern in this article. The patterns described here are refinements of the basic in-context learning capability that paper introduced at scale.

Brown, T., et al. (2020). "Language Models are Few-Shot Learners." NeurIPS.

25The Bitter Lesson and prompt patterns ↩ Back to article

Widdows and Cohen frame this in terms of "The Bitter Lesson" (§4.4): progress in AI has most consistently come from making the best use of computation, not from encoding expert knowledge. They note it is "tempting to assume that a chess computer would need to be programmed by people who really understand chess, that image filters should be designed by image experts, and that we need to study rules of language to build NLP systems. Instead, progress in AI has most consistently come from making the best use of ever-increasing computational resources." Prompt patterns sit in an interesting middle ground: they are human-crafted rules of a sort, but they work by nudging statistical processes rather than encoding domain logic.

Widdows & Cohen, §4.4. See also GH #45.

26Historical perspective on prompting ↩ Back to article

The Historical Notes section of SLP3 Ch. 7 (p.25-27) provides perspective on how rapidly this field has evolved. The term "language model" was coined by Jelinek at IBM in 1975 for n-gram models. The neural language model emerged with Bengio et al. (2003). GPT-2 (Radford et al., 2019) first demonstrated that autoregressive language models could perform zero-shot on NLP tasks. The prompting patterns described in this article emerged in the span of roughly 2020 to 2023, building on decades of language modeling research. J&M use the term "foundation model" (Bommasani et al., 2021) to describe the broader phenomenon of applying LLM technology across domains, suggesting these patterns may generalize beyond text.

SLP3 Ch. 7 Historical Notes. Read SLP3