Sources

Grounding, citations, and further reading for Inside the Decoder-Only Transformer.

All of this is optional. These are the sources I used to write the course, shown here as grounding for the research behind the article. Nothing on this page is required reading, and you do not need to purchase any of these books.

The article itself is self-contained. This page exists so that the work is properly cited and so that anyone who wants to go deeper on a specific topic knows where to look.

About the Sources

SLP3: Jurafsky & Martin

Jurafsky, Daniel & James H. Martin. Speech and Language Processing, 3rd ed. (draft).

The standard academic textbook for NLP. Freely available in draft form at web.stanford.edu/~jurafsky/slp3/. Chapter 8 is the canonical formal treatment of transformer architecture; most notes on this page cite specific equations and sections from that chapter.

Widdows & Cohen: Large Language Models: How They Work and Why They Matter

Widdows, Dominic & Trevor Cohen. SemanticVectors Publishing, 2025.

Accessible and mathematically grounded survey of LLM architecture and behavior. Particularly strong on contextual embeddings, RoPE, and inference-time optimizations.

Raschka: Build a Large Language Model (From Scratch)

Raschka, Sebastian. Manning Publications, 2024.

Implementation-first walkthrough that builds a GPT-style model in PyTorch step by step. Most useful when you want to see the actual tensor shapes and code behind a concept.

Alammar & Grootendorst: Hands-On Large Language Models

Alammar, Jay & Maarten Grootendorst. O'Reilly Media, 2024.

Practitioner-oriented survey from the author of The Illustrated Transformer. Strong on the applied side: retrieval, fine-tuning, prompting, and the encoder-vs-decoder distinction in deployed systems.

Farris, Biderman & Raff: How Large Language Models Work

Farris, Drew, Stella Biderman & Edward Raff. Manning Publications, 2025.

Concept-first introduction aimed at engineers and data scientists who need a working mental model without the full math. Strong on the "fuzzy dictionary" framing of attention and on the autoregressive generation loop.

Georgia Tech Transformer Explainer

Interactive visualization. Polo Club of Data Science.

Live tool at poloclub.github.io/transformer-explainer. Type a prompt and watch token embeddings, attention weights, and layer outputs update in real time. Useful as a classroom demo or quick refresher on Q-K-V flow.

The High-Level Flow

1The residual stream framing

Jurafsky and Martin formalize this pipeline in SLP3 Section 8.2 using the concept of the residual stream (Elhage et al., 2021). They describe each transformer block as reading from and writing back to a single stream of d-dimensional representations. The attention layer is the only component that mixes information across token positions; they call it the token-mixing component. The FFN operates on each position independently. This framing clarifies why the diagram shows residual connections as the backbone: the stream carries the original embedding forward, and each sublayer adds its refinement.

SLP3 §8.2, Eqs. 8.26-8.31. Read SLP3 ↩ Back to article

Tokenization

2The BPE algorithm in two phases

Jurafsky and Martin formalize the BPE algorithm in SLP3 Section 2.4 as having two distinct phases: a trainer that iteratively merges the most frequent adjacent token pairs to build a vocabulary, and an encoder that applies those learned merges greedily to new text. They note that real BPE runs tens of thousands of merges to produce vocabulary sizes of 50,000 to 200,000 tokens, and that for multilingual systems the tokens are vastly dominated by English text, causing other languages to be oversegmented into shorter pieces. This is the mechanism behind the multilingual tax.

SLP3 §2.4.1-2.4.3. Read SLP3 ↩ Back to article

3From handcrafted rules to end-to-end tokenization

Widdows and Cohen provide useful context on the evolution of tokenization. In Ch. 2, they trace the shift from handcrafted, language-specific tokenization rules (e.g., whitespace delimiters, gerund endings) to automated approaches like Byte-Pair Encoding. They note that SentencePiece, used by LLaMA, treats whitespace as an ordinary symbol, enabling purely end-to-end systems that do not depend on any language-specific processing. This reinforces the point that tokenization is deterministic preprocessing, not a neural operation.

Widdows & Cohen, Ch. 2. ↩ Back to article

Token Embeddings

4Embedding lookup as matrix multiplication

Jurafsky and Martin make the "no computation, just retrieval" point precise in SLP3 Section 8.4. They show that a token can be represented as a one-hot vector of shape [1 x |V|], and that multiplying this by the embedding matrix E of shape [|V| x d] simply selects the corresponding row. The entire input sequence becomes a matrix of one-hot vectors [N x |V|] multiplied by E to produce the input matrix X of shape [N x d]. It is literally a table lookup expressed as matrix multiplication.

SLP3 §8.4, Figures 8.12-8.13. Read SLP3 ↩ Back to article

5Static vs contextual embeddings

Widdows and Cohen discuss this static-vs-contextual distinction at length. In Ch. 4, they explain that ELMo provided the first contextual embeddings, where the same word may have different meanings, making word sense disambiguation a prerequisite to accurate interpretation. Their BERT analysis in Ch. 5 (Section 5.2.2, Figure 5.3) shows 2D projections of the token "card" in different contexts, demonstrating that contextual transformer representations cluster by meaning (credit card vs. red card vs. wild card). This directly illustrates the centroid problem.

Widdows & Cohen, Ch. 4-5. ↩ Back to article

Positional Encoding

6Three approaches: absolute, sinusoidal, relative

Jurafsky and Martin present three approaches to positional encoding in SLP3 Section 8.4. Absolute position uses learned embeddings for each position (a matrix E_pos of shape [N x d]). Sinusoidal encoding uses fixed sine/cosine functions so that the relationship between any two positions can be expressed as a linear transformation, enabling the model to learn relative distances without being explicitly trained on them. Relative position methods like RoPE go further, encoding position directly in the attention computation at each layer rather than adding it once at the input. The progression from absolute to relative is a design choice about generalization: absolute is simplest, sinusoidal generalizes to unseen lengths, relative generalizes best.

SLP3 §8.4. Read SLP3 ↩ Back to article

7RoPE: the full mathematical derivation

Widdows and Cohen devote an entire section (Ch. 5, Section 5.3.6) to RoPE, providing the full mathematical derivation using complex numbers and Euler's formula. They explain that instead of adding a positional component to embedding vectors (as in Vaswani et al.), RoPE puts a positional multiplier operator into the embedding similarity computations, comparing position differences as complex exponents. They draw a parallel to wave oscillation in mathematical physics, noting that moving along token positions using complex rotations is modeled like periodic wave oscillation along a time axis. Additional context for the article's brief mention of "rotation in the complex plane."

Widdows & Cohen, Ch. 5 §5.3.6. ↩ Back to article

8Parallel processing and why it matters

Alammar and Grootendorst explain that tokens are processed in parallel rather than sequentially, which is the key advantage over RNNs. They show how RoPE encodes relative position through rotation, enabling models to generalize to longer contexts than seen during training. This connects directly to Flash Attention and grouped-query attention as efficiency improvements on the same core mechanism.

Alammar & Grootendorst, Ch. 3. ↩ Back to article

Self-Attention

9The chicken-and-road coreference example

Jurafsky and Martin open SLP3 Section 8.1 with a memorable coreference example. "The chicken didn't cross the road because it was too tired" versus "The chicken didn't cross the road because it was too wide." In the first, "it" corefers with the chicken; in the second, with the road. A static embedding for "it" cannot capture this distinction; only contextual attention can resolve it. They show (Figure 8.3) how the self-attention weights for "it" attend heavily to both "chicken" and "road," drawing on context to disambiguate. This is the canonical demonstration of why self-attention exists.

SLP3 §8.1, Figure 8.3. Read SLP3 ↩ Back to article

10Interactive visualization

Georgia Tech's Transformer Explainer is an excellent interactive visualization of self-attention and the full transformer pipeline. It lets you type a prompt and watch token embeddings, attention weights, and layer outputs update in real time. Useful as a teaching reference or a quick refresher on how Q-K-V flows through the architecture.

poloclub.github.io/transformer-explainer ↩ Back to article

11Attention as a "fuzzy dictionary"

Farris, Biderman and Raff describe attention as a 'fuzzy dictionary' that uses queries, keys, and values to determine which previous tokens are most relevant when predicting the next token. Their framing makes the Q/K/V mechanism more intuitive than the standard linear algebra explanation.

Farris, Biderman & Raff, Ch. 3. ↩ Back to article

12Why the √d_k scaling factor

Jurafsky and Martin explain the scaling factor in SLP3 Section 8.1.1 (Eq. 8.11). They note that the dot product q_i · k_j can be an arbitrarily large (positive or negative) value, and exponentiating large values can lead to numerical issues and loss of gradients during training. Dividing by sqrt(d_k) normalizes the variance of the dot products to approximately 1, keeping the softmax in a well-behaved range. Without this, the softmax would push nearly all weight onto a single token, making gradients vanishingly small for the others.

SLP3 §8.1.1. Read SLP3 ↩ Back to article

13Attention in action: LLaMA-3 heatmaps

Widdows and Cohen provide a concrete demonstration of scaled dot-product attention in action in Ch. 4 (Section 4.3.2). Using LLaMA-3 (405B), they show attention weight heatmaps for the phrase "Born in 1938, alto saxophonist Charles," where the model correctly predicts "McPherson" by attending to both year-of-birth and instrument tokens. They also show how the same name "Romeo" gets different attention patterns in "Juliet has a Romeo" (predicting "Forrest") vs. "Juliet has an Alfa Romeo" (predicting "BMW"). These examples vividly illustrate how Q-K-V attention resolves ambiguity in practice.

Widdows & Cohen, Ch. 4 §4.3.2. ↩ Back to article

Multi-Head Attention

14Multi-head formalism and dimensionality constraint

Jurafsky and Martin formalize multi-head attention in SLP3 Section 8.1 (Eqs. 8.15-8.20). Each head c has its own weight matrices W^Qc, W^Kc, and W^Vc. The outputs of all A heads are concatenated and projected through a final matrix W^O. In the original Vaswani et al. design, d = 512, A = 8 heads, and d_k = d_v = 64. The key dimensionality constraint is d_v = d/A, meaning the total parameter cost of multi-head attention equals that of a single head operating at the full model dimension. More heads does not mean more parameters; it means more specialized, lower-dimensional subspaces.

SLP3 §8.1, Eqs. 8.15-8.20. Read SLP3 ↩ Back to article

15BERTology and head specialization

Widdows and Cohen corroborate head specialization in Ch. 5. They discuss "BERTology" research and confirm that different attention heads learn to attend to different ranges and relationships, some of which are like grammatical dependencies. They also note that Vaswani et al. found 8 attention heads, each working in 64 dimensions (projected from 512), performed well when concatenated back to a 512-dimensional output. This provides the original design rationale behind the multi-head pattern that GPT-3 later scaled to 96 heads.

Widdows & Cohen, Ch. 5. ↩ Back to article

Feed-Forward Network

16FFN formula and per-layer specialization

Jurafsky and Martin present the FFN formula in SLP3 Section 8.2 (Eq. 8.21): FFN(x) = ReLU(xW₁ + b₁)W₂ + b₂. They note that in the original transformer, d = 512 and d_ff = 2048 (a 4x expansion). A crucial detail: the FFN weights are the same for each token position i, but are different from layer to layer. This means the FFN applies identical transformations to every token at a given depth, but those transformations change as you move up the stack. This is why deeper layers can extract progressively more abstract features.

SLP3 §8.2, Eq. 8.21. Read SLP3 ↩ Back to article

17PyTorch implementation of FFN + shortcuts

Raschka's implementation book provides the actual PyTorch code for each component discussed here. The feed-forward network expands embeddings from 768 to 3072 dimensions (4x), applies GELU activation, then contracts back. Shortcut (residual) connections add a layer's input directly to its output, creating alternate gradient paths.

Raschka, Ch. 4. ↩ Back to article

Residual Connections and Layer Normalization

18LayerNorm derivation and prenorm vs postnorm

Jurafsky and Martin provide the full formal derivation of layer normalization in SLP3 Section 8.2 (Eqs. 8.22-8.25). They describe it as a z-score from statistics: subtract the mean, divide by the standard deviation, then apply learned gain (gamma) and offset (beta) parameters. A subtle but important point: the prenorm architecture (LayerNorm before attention and FFN) is not what Vaswani et al. originally proposed. The original paper used postnorm (LayerNorm after). Prenorm turns out to train more stably at scale, though it requires one extra LayerNorm at the very top of the stack.

SLP3 §8.2, Eqs. 8.22-8.25, footnote 2. Read SLP3 ↩ Back to article

19The vanishing gradient motivation

Widdows and Cohen explain the motivation for these mechanisms from first principles. In Ch. 3, they describe the vanishing gradient problem: with sigmoid activations, the derivative can become very small at extreme input values, causing learning to stall. In Ch. 4, they note the transformer architecture includes normalization and smoothing steps in between attention and feed-forward layers to avoid known problems like overfitting and vanishing gradients. They also connect this to LSTMs, which introduced gates specifically to mitigate vanishing gradients in RNNs, showing the residual connection as the transformer's more elegant solution to the same problem.

Widdows & Cohen, Ch. 3-4. ↩ Back to article

Output: From Vectors to Words

20Weight tying and the unembedding layer

Jurafsky and Martin describe the full language modeling head in SLP3 Section 8.5. The key mechanism is weight tying: the same embedding matrix E of shape [|V| x d] that maps tokens to vectors at the input is transposed to E^T of shape [d x |V|] at the output, mapping vectors back to logits over the vocabulary. They call this transposed matrix the unembedding layer. Weight tying reduces parameter count and enforces a consistent representation between the input and output spaces. The logit for token k is simply the dot product of the final hidden state with the embedding for token k.

SLP3 §8.5, Eqs. 8.46-8.47. Read SLP3 ↩ Back to article

21Quality vs diversity in sampling

Jurafsky and Martin frame the sampling tradeoff in SLP3 Section 8.6 as a tension between quality and diversity. Methods that emphasize the most probable tokens produce text rated as more accurate and coherent, but also more boring and repetitive. Methods that sample from the middle of the distribution produce more creative and diverse text, but risk incoherence. They present top-k (Section 8.6.1) as a simple generalization of greedy decoding, and top-p (Section 8.6.2, Holtzman et al. 2020) as the adaptive improvement: it keeps the top p percent of the probability mass rather than a fixed count, so it dynamically adjusts to the distribution's shape.

SLP3 §8.6.1-8.6.2. Read SLP3 ↩ Back to article

22Autoregressive generation and temperature

Farris, Biderman and Raff emphasize that LLMs are autoregressive, generating one token at a time and feeding each back as input. Temperature controls the creativity-reliability tradeoff. This is why the same prompt can produce different outputs.

Farris, Biderman & Raff, Ch. 3. ↩ Back to article

The KV Cache

23Mathematical basis for KV caching

Jurafsky and Martin explain the mathematical basis for KV caching in SLP3 Section 8.3. They show that attention is computed as QK^T, an [N x N] matrix of all pairwise query-key dot products (Eq. 8.33, Figure 8.9). The causal mask sets the upper triangle to negative infinity, ensuring each token only attends to previous positions (Figure 8.10). Because adding a new token only adds one new row to Q and one new column to K, the previously computed K and V vectors remain unchanged. The cache stores exactly these prior K and V matrices so they need not be recomputed. This is also why attention is O(N²): the QK^T matrix has N² entries.

SLP3 §8.3, Eqs. 8.32-8.34. Read SLP3 ↩ Back to article

24PagedAttention and non-contiguous caches

Widdows and Cohen discuss the KV cache explicitly in Ch. 5 (Section 5.3.5), and add an important detail: PagedAttention (Kwon et al.) allows the KV cache to be non-contiguous in memory, meaning multiple streams of text can be generated simultaneously without reserving space for them as they grow. They also note that a single long instruction can be processed once and then reused with different inputs, which is an optimization only for inference, not training. This connects directly to the point about GPU memory as the bottleneck.

Widdows & Cohen, Ch. 5 §5.3.5. ↩ Back to article

Encoder-Decoder vs. Decoder-Only

25Why "decoder-only" is a confusing name

Jurafsky and Martin add a helpful terminological note in SLP3 Section 8.5. They explain that the term "decoder-only model" is confusing because the original introduction of the transformer had an encoder-decoder architecture, and it was only later that the standard paradigm for causal language model was defined by using only the decoder part of this original architecture. The "decoder" label is a historical artifact of the translation-era framing. What we now call a "decoder-only" model is really just a causal (left-to-right) transformer with a language modeling head.

SLP3 §8.5. Read SLP3 ↩ Back to article

26Encoder-only, decoder-only, and the training paradigm

Alammar and Grootendorst draw a clear line between encoder-only models (BERT, used for classification and NER) and decoder-only models (GPT, used for generation). They frame the distinction through the training paradigm: pretraining on large unlabeled data followed by fine-tuning for specific tasks. This two-stage pattern explains why the same base architecture powers both chatbots and search systems.

Alammar & Grootendorst, Ch. 1. ↩ Back to article

Sources

About the Sources

SLP3: Jurafsky & Martin

Widdows & Cohen: Large Language Models: How They Work and Why They Matter

Raschka: Build a Large Language Model (From Scratch)

Alammar & Grootendorst: Hands-On Large Language Models

Farris, Biderman & Raff: How Large Language Models Work

Georgia Tech Transformer Explainer

The High-Level Flow

1The residual stream framing

Tokenization

2The BPE algorithm in two phases

3From handcrafted rules to end-to-end tokenization

Token Embeddings

4Embedding lookup as matrix multiplication

5Static vs contextual embeddings

Positional Encoding

6Three approaches: absolute, sinusoidal, relative

7RoPE: the full mathematical derivation

8Parallel processing and why it matters

Self-Attention

9The chicken-and-road coreference example

10Interactive visualization

11Attention as a "fuzzy dictionary"

12Why the √dk scaling factor

13Attention in action: LLaMA-3 heatmaps

Multi-Head Attention

14Multi-head formalism and dimensionality constraint

15BERTology and head specialization

Feed-Forward Network

16FFN formula and per-layer specialization

17PyTorch implementation of FFN + shortcuts

Residual Connections and Layer Normalization

18LayerNorm derivation and prenorm vs postnorm

19The vanishing gradient motivation

Output: From Vectors to Words

20Weight tying and the unembedding layer

21Quality vs diversity in sampling

22Autoregressive generation and temperature

The KV Cache

23Mathematical basis for KV caching

24PagedAttention and non-contiguous caches

Encoder-Decoder vs. Decoder-Only

25Why "decoder-only" is a confusing name

26Encoder-only, decoder-only, and the training paradigm

12Why the √d_k scaling factor