Sources

Grounding, citations, and further reading for From Prompt to Token: How LLM Inference Actually Works.

All of this is optional. These are the sources I used to write the course, shown here as grounding for the research behind the article. Nothing on this page is required reading, and you do not need to purchase any of these books.

The article itself is self-contained. This page exists so that the work is properly cited and so that anyone who wants to go deeper on a specific topic knows where to look.

About the Sources

SLP3: Jurafsky & Martin

Jurafsky, Daniel & James H. Martin. Speech and Language Processing, 3rd ed. (draft).

The standard academic textbook for NLP. Freely available in draft form at web.stanford.edu/~jurafsky/slp3/. Chapter 8 is the canonical formal treatment of transformer architecture and inference; most notes on this page cite specific equations and sections from that chapter.

Widdows & Cohen: Large Language Models: How They Work and Why They Matter

Widdows, Dominic & Trevor Cohen. SemanticVectors Publishing, 2025.

Accessible and mathematically grounded survey of LLM architecture and behavior. Particularly strong on contextual embeddings, RoPE, attention visualization, and inference-time optimizations.

Alammar & Grootendorst: Hands-On Large Language Models

Alammar, Jay & Maarten Grootendorst. O'Reilly Media, 2024.

Practitioner-oriented survey from the author of The Illustrated Transformer. Strong on the applied side: tokenization pipelines, KV-cache benchmarks, and the engineering tradeoffs in deployed systems.

Georgia Tech Transformer Explainer

Interactive visualization. Polo Club of Data Science.

Live tool at poloclub.github.io/transformer-explainer. Type a prompt and watch token embeddings, attention weights, and layer outputs update in real time. Useful as a classroom demo or quick refresher on Q-K-V flow.

The Pipeline at a Glance

1Causal self-attention and the scaling of context windows

Widdows and Cohen provide useful context on the encoder-vs-decoder distinction in Ch. 5.1. They explain that "decoder-only models like GPT use a form of self-attention where each token attends only to earlier positions in the sequence, which has become called causal self-attention. This is also what's meant by autoregressive generation." They include a helpful table (Table 5.1) showing how context lengths have scaled from BERT's 512 tokens up to LLaMA-4's 10M.

Widdows & Cohen, Ch. 5.1.

Step 1: Tokenization

2BPE as trainer plus encoder

Jurafsky and Martin present BPE in SLP3 Section 2.4 as having two components: a trainer that iteratively merges the most frequent adjacent byte pairs to build a vocabulary, and an encoder that greedily applies those learned merges to new text. The algorithm begins with individual characters and grows the vocabulary through k merges, producing typically 50,000 to 200,000 tokens. They note a critical practical detail: BPE usually runs on UTF-8 bytes rather than Unicode characters, meaning the starting vocabulary is just 256 byte values. There are never unknown tokens, because any input can be decomposed into bytes.

SLP3 §2.4.1-2.4.3. Read SLP3

3From handcrafted rules to language-agnostic tokenization

Widdows and Cohen trace the evolution of tokenization from handcrafted language-specific rules to automated approaches like Byte-Pair Encoding in Ch. 2 and Ch. 4. They note that BPE's rule is simply "replace the most frequent byte-pair with a single byte, and keep going," and that this is part of the SentencePiece tokenizer used in LLaMA models. They emphasize that SentencePiece treats whitespace as an ordinary symbol, meaning the tokenizer is truly language-agnostic.

Widdows & Cohen, Ch. 2, Ch. 4.

4The vocabulary table and the subword fallback

Alammar and Grootendorst walk through how the tokenizer prepares inputs by mapping each word or subword to an integer ID in the model's vocabulary table. They emphasize that the vocabulary is fixed at training time, so any word not in the table gets split into smaller subword pieces. This is the first and most overlooked bottleneck in the inference pipeline.

Alammar & Grootendorst, Ch. 2.

Step 2: Embedding Lookup

5Embedding lookup as one-hot matrix multiplication

Jurafsky and Martin formalize this table lookup in SLP3 Section 8.4. They represent each token as a one-hot vector of shape [1 x |V|] where only one element is 1. Multiplying this by the embedding matrix E of shape [|V| x d] selects the corresponding row (Figures 8.12-8.13). For a full input sequence of N tokens, the one-hot matrix [N x |V|] multiplied by E produces the input matrix X of shape [N x d]. This is why the article says "not a computation, just retrieval," though technically it is a matrix multiplication where the one-hot structure guarantees a simple row selection.

SLP3 §8.4, Figures 8.12-8.13. Read SLP3

6The static-embedding problem: the many Romeos

Widdows and Cohen illustrate the static-embedding problem vividly in Ch. 3 using the word "romeo." A single global embedding conflates Shakespeare's Romeo with the Alfa Romeo sportscar. They show that words with similar embeddings can be sequential co-occurrences (star:trek), morphological variants (star:stars), or categorically related (mercutio:tybalt). This directly supports the point that context-free embeddings are just an average over all usages. They also note that the SGNS model learns separate input and output weight vectors, and that typically only the input weights are retained for downstream use.

Widdows & Cohen, Ch. 3.

Step 3: Positional Encoding

7Why sinusoidal encodings support relative distance

Jurafsky and Martin explain why sinusoidal encodings support relative distance computation in SLP3 Section 8.4. The alternation between sine and cosine functions at each dimension means that for any fixed offset k, the encoding PE(pos + k) can be expressed as a linear transformation of PE(pos). They also note a practical limitation of learned absolute position embeddings: positions near the end of the training length have fewer training examples, so they generalize poorly. Sinusoidal encodings sidestep this because the function is deterministic and applies to any length.

SLP3 §8.4. Read SLP3

8Fourier-series interpretation of the PE formula

Widdows and Cohen reproduce the same PE formula in Section 4.3.2 and offer a Fourier-series interpretation: the alternation between sine and cosine functions supports both odd (directed) and even (symmetric) contributions, so that for any fixed offset k, the positional encoding PE(pos + k) can be approximated as a linear transformation of PE(pos). They emphasize that this design makes it possible to model positional relationships "over various sequence ranges, using just linear transformations, which can be broken down into many different scalar product operations and computed in parallel."

Widdows & Cohen, §4.3.2.

9RoPE as complex-plane rotation

Widdows and Cohen devote an entire section (5.3.6) to RoPE, calling it "a brilliant example of applying complex numbers to language processing." They show how RoPE multiplies embedding coordinate pairs by rotation matrices with angle mθ, where m is the token position. The comparison of two positions then reduces to the complex exponent e^i(m-n)θ, encoding relative distance as a phase difference. They draw an explicit parallel to wave physics: "moving along token positions using complex rotations is modeled very like periodic wave oscillation along a time axis."

Widdows & Cohen, §5.3.6.

10Why still teach sinusoidal encodings

The original sinusoidal positional encodings are now mostly a historical footnote. Every serious open model has moved to RoPE or ALiBi. But understanding sinusoidal encodings helps students see why positional information needs to encode relative distance, not just absolute position.

Author's note.

Step 4: The Transformer Layers

11Pedagogical build-up to Q/K/V projections

Jurafsky and Martin build up to the attention formula pedagogically in SLP3 Section 8.1. They start with a simplified version (Eq. 8.6): attention output a_i is just the weighted sum of prior representations, where weights come from dot-product similarity. Then they introduce Q, K, V projections (Eqs. 8.9-8.14) to allow the model to learn different representations for "what am I looking for" versus "what do I contain" versus "what will I contribute." The crucial insight: without the Q/K/V separation, every token's role as query, key, and value would be conflated in a single vector. The three projections give the model the freedom to attend based on one facet while contributing different information.

SLP3 §8.1, Eqs. 8.6-8.14. Read SLP3

12Interactive visualization of Q-K-V flow

Georgia Tech's Transformer Explainer is an excellent interactive visualization of self-attention and the full transformer pipeline. It lets you type a prompt and watch token embeddings, attention weights, and layer outputs update in real time. Useful as a teaching reference or a quick refresher on how Q-K-V flows through the architecture.

poloclub.github.io/transformer-explainer

13Worked attention examples from 405B LLaMA-3

Widdows and Cohen present the scaled dot-product attention formula (Eq. 4.2) in Section 4.3.2, with excellent worked examples. They show attention weight visualizations from a 128-head layer of the 405B LLaMA-3 model, demonstrating how different heads capture different relationships. One head correctly routes "Charles" to "alto" and "1938" to predict "McPherson." Another routes "Romeo" to "Juliet" vs. "Alfa" to predict either "Forrest" or "BMW." These concrete examples powerfully illustrate the abstract description of heads specializing in different relationship types.

Widdows & Cohen, §4.3.2.

14Formal causal mask as additive negative infinity

Jurafsky and Martin formalize the causal mask in SLP3 Section 8.3 (Eqs. 8.33-8.34, Figure 8.10). They show the N x N QK^T matrix with the upper triangle set to negative infinity, and write the mask matrix M as: M_ij = -infinity for all j > i, and M_ij = 0 otherwise. After softmax, the negative-infinity entries become exactly zero, eliminating all attention to future positions. They note this is done in practice by adding M to the scaled QK^T matrix before softmax, which is computationally simpler than conditional masking.

SLP3 §8.3, Eqs. 8.33-8.34, Figure 8.10. Read SLP3

15Masking as a choice, not an architectural constraint

Widdows and Cohen describe the masking choice carefully in their discussion of the original transformer architecture (Section 4.3.2): "The optional Mask step can be used to set all the token vectors after a given position to zero, which simulates language generation, because it makes the future appear empty." They note that masking is not used during the input encoding stage of translation models (where bidirectional context helps), but is used in the output decoding stage. This supplements the article's binary "encoder vs. decoder" framing by showing that the original transformer used both modes in the same model.

Widdows & Cohen, §4.3.2.

16Why the FFN sits between attention layers

Widdows and Cohen describe the transformer architecture (Figure 4.15, reproducing the original Vaswani et al. diagram) as "several attention layers, interleaved with feed forward layers," with "other normalization and smoothing steps in between, that avoid known problems like overfitting and vanishing gradients." Their treatment of feed-forward layers goes back to basics in Ch. 3, tracing the concept from simple layered networks through to their role in transformers. Useful additional context for readers wanting to understand why the FFN sits between attention layers.

Widdows & Cohen, Ch. 3, §4.3.2.

Step 5: The Final Projection

17Weight tying and the dual role of the embedding matrix

Jurafsky and Martin formalize weight tying in SLP3 Section 8.5 (Eqs. 8.46-8.47). The logit vector is computed as u = h_N^L E^T, where h_N^L is the output of the final transformer layer at position N and E^T is the transpose of the embedding matrix. They call this transposed matrix the unembedding layer. Weight tying means the same matrix E must be good at two tasks simultaneously: mapping tokens to useful initial representations (embedding) and mapping final representations back to token predictions (unembedding). During training, gradient descent optimizes E for both roles jointly.

SLP3 §8.5, Eqs. 8.46-8.47. Read SLP3

From Logits to Probabilities: Softmax

18Softmax: the same function, used twice

Widdows and Cohen discuss the softmax function in Ch. 3 (Section 3.2.3) as a general activation-to-probability mapping, and again in Ch. 4 when presenting the scaled dot-product attention equation. In Section 4.3.2, they write that "the softmax ensures that the attention weights sum to 1 across all positions for a given query, effectively forming a probability distribution over the input sequence, and making sure that the system pays attention to the strongest associations, attention is rationed." The same softmax that converts logits to probabilities for sampling also produces the attention weights inside the model.

Widdows & Cohen, §3.2.3, §4.3.2.

Step 7: Sampling Strategies

19Temperature in a real reasoning task

Widdows and Cohen provide a concrete example of temperature in action in Ch. 5. When testing a reasoning task (counting palindromic primes), they note it "was arrived at with a temperature setting of 0.2, which constrains the sampling distribution such that a small number of high-logit tokens are most likely to be selected. An earlier run with higher temperature showed the same reasoning strategy, but produced an overestimate." This real-world example illustrates the point that temperature does not change the model's knowledge, only its selection behavior.

Widdows & Cohen, Ch. 5.

20Formal definition of top-p (nucleus) sampling

Jurafsky and Martin present top-p sampling formally in SLP3 Section 8.6.2 (Eq. 8.48). They define the top-p vocabulary V^(p) as the smallest set of words such that the sum of P(w|w_<t) over that set is at least p. They frame the motivation clearly: with top-k, the number of candidates is fixed regardless of the distribution's shape, which is suboptimal. When the distribution is peaked, top-k wastes slots on improbable tokens. When it is flat, top-k may cut off viable options. Top-p solves this by measuring probability mass rather than count, dynamically increasing and decreasing the pool of word candidates.

SLP3 §8.6.2, Eq. 8.48. Read SLP3

21What production APIs actually expose

The Anthropic API exposes temperature and top_p but not top_k. OpenAI's API exposes temperature and top_p. Google's Gemini API exposes all three. The industry has largely converged on temperature plus top-p as the standard control surface.

Author's note.

The KV-Cache

22Concrete benchmark: 4.5s versus 21.8s

Alammar and Grootendorst benchmark KV caching at 4.5 seconds versus 21.8 seconds without it for the same generation task. They explain that because autoregressive generation only adds one new token per step, previously computed key and value matrices do not change and can be reused. This makes the cache the single most important optimization in production inference.

Alammar & Grootendorst, Ch. 3.

23PagedAttention: non-contiguous KV memory

Widdows and Cohen discuss the KV cache memory problem and its solution in Section 5.3.5. They describe PagedAttention (Kwon et al.) as a key optimization: it allows the KV cache to be non-contiguous in memory, meaning "multiple streams of text can be generated simultaneously without the need to reserve space for them as they grow, and also that a single long instruction can be processed once and then used many times with different inputs." This addresses the exact memory pressure described in the article, and goes a step further by explaining how real serving systems manage it.

Widdows & Cohen, §5.3.5.

What This Means for Practitioners

24Quality versus diversity and why chain-of-thought works

Jurafsky and Martin frame this same point in SLP3 Section 8.6 through the lens of the quality-diversity tradeoff. Methods that emphasize the most probable words produce text that is more accurate, more coherent, and more factual, but also more boring and more repetitive. Chain-of-thought works because it forces the model through a sequence of high-confidence intermediate steps, each of which narrows the distribution for the next step. The model is not reasoning; it is generating a sequence of tokens where each one makes the next correct token more probable. The reasoning is an emergent property of the token sequence, not an internal planning process.

SLP3 §8.6. Read SLP3