Sources

Grounding, citations, and further reading for The Academic History of Prompt Engineering.

All of this is optional. These are the primary sources that ground the article's historical claims. Nothing on this page is required reading, and you do not need to purchase any of these books or journal articles.

The article itself is self-contained. This page exists so that the work is properly cited and so that students who want to read the original papers know where to look.

About the Sources

Taylor: Cloze Procedure (1953)

Taylor, Wilson L. (1953). Journalism Quarterly, 30(4), 415-433.

The primary source for the claim that prompting predates machine learning. Taylor's Cloze procedure deletes every fifth word from a passage and asks human readers to restore it, producing a readability measure grounded in predictability. The deletion-and-restoration task is structurally identical to the masked-language-modeling objective used seven decades later. Available via SAGE Journals at doi.org/10.1177/107769905303000401.

Shannon: Information Theory (1948, 1951)

Shannon, Claude E. (1948, 1951). The Bell System Technical Journal.

Two companion papers that establish the predictive-completion framing. The 1948 paper introduces entropy as a measure of information content; the 1951 paper estimates the entropy of printed English by having human subjects guess successive letters, directly anticipating next-token prediction. Both papers are freely available: the 1948 paper at people.math.harvard.edu, the 1951 paper at archive.org.

Brown et al.: GPT-3 (2020)

Brown, Tom B., et al. (2020). arXiv:2005.14165.

The inflection point for prompt engineering as a discipline. The GPT-3 paper demonstrates that a 175-billion-parameter model can perform new tasks from a handful of natural-language demonstrations, without gradient updates. Few-shot in-context learning made the prompt itself a load-bearing interface, not a convenience. Available on arXiv at arxiv.org/abs/2005.14165.

Wei et al.: Chain-of-Thought Prompting (2022)

Wei, Jason, et al. (2022). arXiv:2201.11903.

The primary source for the claim that chain-of-thought is emergent at scale. The paper shows that prompting a sufficiently large model with intermediate reasoning steps unlocks performance on arithmetic, commonsense, and symbolic reasoning tasks that smaller models cannot solve regardless of prompting. Available on arXiv at arxiv.org/abs/2201.11903.

Ouyang et al.: InstructGPT (2022)

Ouyang, Long, et al. (2022). arXiv:2203.02155.

The canonical reference for instruction tuning and RLHF. The paper describes how supervised fine-tuning on human demonstrations plus reinforcement learning from human preferences produces a model that follows natural-language instructions, which is why modern prompts read as imperatives rather than as completion stubs. Available on arXiv at arxiv.org/abs/2203.02155.

Intended scope

1Taylor 1953: the Cloze procedure ↩ Back to article

Taylor's central contribution is the procedure itself: delete every nth word (typically every fifth) from a passage, present the mutilated text to readers, and score their ability to restore the missing tokens. The resulting Cloze score correlates with traditional readability formulas and, critically, treats comprehension as a predictive act. Pages 415-419 of the original Journalism Quarterly article establish the methodology; pages 420-428 validate it against Dale-Chall and Flesch readability measures. The term "cloze" itself is a contraction of Gestalt psychology's "closure," signaling that Taylor saw completion as the fundamental cognitive operation being measured.

For the article's purposes, Taylor establishes the conceptual precedent: probing a language system by withholding information and observing what it restores. The technique is identical to masked-language-modeling; only the language system has changed.

Taylor, W. L. (1953). "Cloze Procedure: A New Tool for Measuring Readability." Journalism Quarterly, 30(4), 415-433.

2Shannon 1948: entropy of communication ↩ Back to article

Shannon's 1948 paper introduces the mathematical foundation for quantifying information. Section 6 (pages 393-396) defines entropy H = -sum(p_i log p_i) as the average uncertainty of a random variable, and Section 7 applies it to English text by modeling letter sequences as Markov processes of increasing order. The progression from zero-order to third-order approximations on pages 388-389 shows how each additional context token reduces uncertainty, which is precisely the mechanism a modern language model exploits.

The relevance to prompt engineering is structural rather than historical: entropy gives us the language to say why prompts work at all. A prompt reduces the conditional entropy of the next token, and the Markov-chain figures in Shannon's paper are the first published examples of conditional text generation.

Shannon, C. E. (1948). "A Mathematical Theory of Communication." The Bell System Technical Journal, 27(3), 379-423.

3Shannon 1951: predicting printed English ↩ Back to article

The 1951 paper is more immediately relevant to prompting than the 1948 paper. Shannon estimates the entropy of English by asking human subjects to guess successive letters of a text they have not seen, recording how many guesses each letter requires. The experimental setup on pages 52-55 is functionally identical to autoregressive next-token prediction: condition on prefix, sample candidate, score the candidate against ground truth. Shannon's bounds of roughly 0.6 to 1.3 bits per letter (page 64) remained the standard benchmark for decades.

For the article, this paper is the clearest early example of treating language as a prediction problem that can be measured, not just modeled. The methodological ancestry from Shannon's guessing game to log-probability-based evaluation of language models is direct.

Shannon, C. E. (1951). "Prediction and Entropy of Printed English." The Bell System Technical Journal, 30(1), 50-64.

4Brown et al. 2020: few-shot in-context learning ↩ Back to article

Section 2.1 of the GPT-3 paper defines the three in-context learning regimes (zero-shot, one-shot, few-shot) that became the vocabulary of prompt engineering. Figure 1.2 on page 4 shows the accuracy-versus-parameter-count curves for each regime, establishing that few-shot performance emerges with scale. The evaluation suite in Section 3 covers 42 benchmarks, but the load-bearing claim is architectural: no gradient updates are required to adapt the model to a new task, so the prompt becomes the adaptation mechanism.

This is the paper that turned prompting from a convenience into a discipline. The article's framing of Week 3 as "prompt engineering as craft" rests on the GPT-3 result that prompt quality directly moves measurable benchmarks.

Brown, T. B., et al. (2020). "Language Models are Few-Shot Learners." arXiv:2005.14165.

5Wei et al. 2022: chain-of-thought as emergent behavior ↩ Back to article

The paper's central experimental result is in Figure 4 (page 5): chain-of-thought prompting provides essentially no benefit for models below roughly 100 billion parameters, then provides a large benefit once models cross that threshold. Table 3 on page 7 reports the GSM8K arithmetic benchmark jumping from 17.9% to 56.9% on PaLM 540B when intermediate reasoning steps are included in the prompt. Section 6 frames the behavior as emergent, not continuous, which is the specific claim the article references.

The methodological contribution matters as much as the accuracy numbers: Wei et al. show that the prompt can elicit reasoning traces that the model is already capable of but does not spontaneously produce. The prompt shapes the output distribution rather than adding new capability.

Wei, J., et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." arXiv:2201.11903.

6Ouyang et al. 2022: instruction tuning and RLHF ↩ Back to article

Section 3 of the InstructGPT paper describes the three-stage training recipe: supervised fine-tuning on human-written demonstrations (SFT), reward model training on pairwise preferences, and proximal policy optimization against the reward model. Figure 1 on page 3 shows that a 1.3-billion-parameter InstructGPT is preferred to the 175-billion-parameter base GPT-3 by human labelers, which is the result that reset industry expectations about what scale alone can provide.

For the article, the relevance is stylistic: modern prompts read as imperative instructions ("summarize," "translate," "explain step by step") because the models were trained on exactly that distribution. The grammar of contemporary prompting is a direct consequence of the InstructGPT training pipeline.

Ouyang, L., et al. (2022). "Training Language Models to Follow Instructions with Human Feedback." arXiv:2203.02155.