Sources

Grounding, citations, and further reading for The Art of Chunking.

All of this is optional. These are the sources behind the article. Nothing on this page is required reading, and you do not need to purchase any of these books.

The article itself is self-contained. This page exists so that the work is properly cited and so that anyone who wants to go deeper knows where to look.

References

1Lewis, P

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." Advances in Neural Information Processing Systems, 33.

2Kamradt, G

Kamradt, G. (2023). "5 Levels of Text Splitting." Full Stack Retrieval.

3Liu, N

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv preprint arXiv:2307.03172.

4LangChain

LangChain. (2024). "Text Splitters." LangChain Documentation.

A Brief History of Text Segmentation

5Grounding note

SLP3 §11.1.1 formalizes Salton's intuition as the term-document matrix, where each column represents a document as a count vector over the vocabulary. The entire IR framework assumes a document is the atomic unit of retrieval, a column in the matrix. What the article calls "chunking" is, in formal terms, the decision about what constitutes a column. Jurafsky and Martin illustrate this with Shakespeare plays: each play is one column vector. But a play could just as easily be split into acts or scenes, and each split would produce different term-document matrices with different retrieval properties.

6Grounding note

Widdows and Cohen provide useful context here. Their Ch. 2 traces vector-based search engines back to the 1960s, when information retrieval researchers moved beyond Boolean logic to graded relevance rankings using vector models. They note that "keyword-based information retrieval began with automating the work of librarians in the 1950s, and was making billions of dollars by the early 2000s." The chunking problem is a direct descendant of this earlier question of what constitutes the right retrieval unit. Widdows & Cohen, Issue #45

7Grounding note

Widdows and Cohen describe the TREC evaluation methodology in detail in Ch. 2, Section 2.3.3. They trace it back to the Cranfield experiments of the early 1960s, which established precision and recall as the standard measures for retrieval quality. TREC-style shared tasks made it "harder to prefer complicated language processing systems over simple ones" by requiring demonstrated results rather than theoretical arguments. This mirrors the article's later point about empirically evaluating chunking strategies. Widdows & Cohen, Issue #45

8Grounding note

SLP3 §11.3 explains the core motivation for the shift from sparse to dense retrieval: the vocabulary mismatch problem (Furnas et al., 1987). With tf-idf or BM25, the user posing a query "needs to guess exactly what words the writer of the answer might have used." Dense embeddings solve this by encoding semantics rather than exact terms. But the flip side, which the article highlights, is that dense models compress everything into a single vector. This creates a new problem: the retrieval unit must be small enough that the single vector can faithfully represent it.

9Grounding note

Widdows and Cohen note in Ch. 2, Section 2.3.1 that the BM25 weighting scheme "adjusts for document length," and parenthetically observe that "Deciding whether documents should be indexed at the book, chapter, paragraph, or verse level is a design decision that also affects document lengths." This is essentially the chunking question stated in information retrieval terms: the granularity of the indexing unit is a fundamental design choice that predates RAG by decades. Widdows & Cohen, Issue #45

10Grounding note

Farris et al. note that context size determines how much information an LLM can process at once, but larger contexts don't guarantee better understanding. Chunking strategy directly determines what fits in that context window and how well the model can use it. See GH #3, Ch. 5.

Why Chunking Matters

11Grounding note

SLP3 §11.1.3 formalizes this "nearest-neighbor search" as cosine similarity: score(q,d) = cos(q,d) = (q . d) / (|q| |d|). The cosine is the normalized dot product of the query vector and document vector. For dense retrieval, SLP3 §11.3 notes that modern systems use approximate nearest neighbor search algorithms like FAISS (Johnson et al., 2017) because "finding the set of dense document vectors that have the highest dot product with a dense query vector is an instance of the problem of nearest neighbor search." The quality of each document vector, which is exactly what chunking determines, is the foundation on which this entire search mechanism rests.

12Grounding note

Widdows and Cohen provide the mathematical foundation for this in Ch. 2, Section 2.1. They explain that cosine similarity acts as the metric for comparing vectors, where "higher scores mean vectors are closer to one another." They also note that in higher dimensions, randomly chosen pairs of vectors tend to have cosine similarity closer to zero, which means that meaningful high cosine scores become rarer and more significant. This is why precise, focused chunk embeddings matter: they produce stronger, more distinctive similarity signals. Widdows & Cohen, Issue #45

13Grounding note

SLP3 §11.1.2 shows that even classical sparse retrieval recognized the document-length problem. BM25 includes a parameter b that explicitly controls "the importance of document length normalization." When b = 1, BM25 scales fully by length; when b = 0, it ignores length entirely. The recommended default is b = 0.75. This is the sparse retrieval community's engineering answer to the same problem that motivates chunking in dense retrieval: longer documents accumulate more term matches simply by being longer, not by being more relevant. Chunking is the dense retrieval equivalent of tuning b.

14Grounding note

Widdows and Cohen illustrate this exact problem vividly in Ch. 3. They describe how a RAG system queried about the Alfa Romeo car brand could be led astray by documents about Shakespeare's Romeo and Juliet, because "a single global embedding vector is used to represent all of the contextual meanings of a word." This is the average meaning problem at the word level, and it compounds at the chunk level: a chunk mixing automotive and literary content produces an embedding that is a poor match for either topic. Widdows & Cohen, Issue #45

15Grounding note

Alammar & Grootendorst emphasize that representing a long document as a single vector loses information, and that chunking into multiple vectors enables more granular nearest-neighbor search. They recommend chunking with overlap to avoid splitting concepts at boundaries. See GH #5, Ch. 8.

Fixed-Size Chunking: The Baseline

16Grounding note

SLP3 §2.1 introduces Heaps' Law (|V| = kN^beta, where beta is typically 0.44-0.56), which describes how vocabulary size grows sub-linearly with corpus size. This has a subtle implication for chunking: larger chunks contain more unique word types, so their tf-idf and BM25 vectors are denser and more discriminative. Smaller chunks, by contrast, have sparser term distributions. At 256 tokens, many technical terms may appear only once per chunk, making tf-idf scores noisy. At 1024 tokens, the same terms appear with more reliable frequencies. This is one reason why hybrid retrieval (BM25 + dense) can partially compensate for aggressive chunking: BM25 benefits from slightly larger chunks while dense retrieval benefits from smaller, focused ones.

17Grounding note

SLP3 §2.4 provides the formal definition of the BPE algorithm that underlies the tiktoken encoder used in the code above. The algorithm has two components: a trainer that iteratively merges the most frequent adjacent byte pairs in a corpus to build a vocabulary, and an encoder that greedily applies the learned merges to new text. The key insight for chunking is that BPE tokens are neither characters nor whitespace-delimited words. They are data-driven subword units whose boundaries depend on the training corpus. This means "512 tokens" represents a different amount of text depending on which tokenizer you use, and the code's choice of cl100k_base (the GPT-4 tokenizer) is itself a consequential decision.

Overlap: Bridging the Boundaries

18Grounding note

Raschka's sliding window approach to creating training pairs mirrors chunking strategy decisions: stride controls overlap between windows, and context length sets the maximum window size. The same tradeoff between coverage and redundancy applies to both training data preparation and RAG chunking. See GH #4, Ch. 2.

Recursive Chunking: Respecting Structure

19Grounding note

SLP3 §2.6 notes that regular expressions play a critical role in the pretokenization step of modern NLP pipelines, "breaking the input at spaces and punctuation, stripping off clitics, and breaking numbers into sets of digits." The separator hierarchy in recursive chunking is essentially a regex-based pretokenization applied at the document level rather than the sentence level. The same principles apply: you need patterns that match structural boundaries (section breaks, paragraph breaks) before falling through to finer-grained ones (sentence boundaries, word boundaries). LangChain's separator list is a chunking-specific regex hierarchy.

Semantic Chunking: Following the Meaning

20Grounding note

The cosine similarity measure used in step 3 here is formally developed in Widdows and Cohen, Ch. 2, Section 2.1. They show that cosine is the scalar product divided by both vector lengths, and that it generalizes to any number of dimensions. Interestingly, LSA variants described in Ch. 2, Section 2.4 used "sliding-window methods, where a word's distribution is based on which other words appear nearby" -- an earlier form of measuring local semantic coherence that anticipates the consecutive-sentence comparison used in semantic chunking. Widdows & Cohen, Issue #45

21Grounding note

SLP3 §11.3 describes the bi-encoder architecture that underpins the sentence-transformer model used in this code. In a bi-encoder, separate encoder models produce embeddings for queries and documents independently; the relevance score is simply the dot product of these vectors. "We encode each document, and store all the encoded document vectors in advance." This architecture is what makes the sentence-by-sentence embedding step in semantic chunking feasible: you encode each sentence once, store the vectors, and then compute pairwise cosine similarities cheaply. A cross-encoder, by contrast, would require passing every sentence pair through the model jointly, which would make semantic chunking prohibitively expensive.

The Chunk Size and Retrieval Quality Curve

22Grounding note

The book frames dense retrieval as turning search into a nearest-neighbor problem in embedding space. This means chunk size directly determines what each "neighbor" represents -- too large and the nearest neighbor is semantically vague, too small and it lacks context. See GH #5, Ch. 8.

Common Pitfalls

23Grounding note

SLP3 §2.4.3 drives this point home with a striking example. The same Spanish sentence that tokenizes into 18 tokens in English requires 33 tokens due to BPE's bias toward English training data. Words like hondo ("deep") get split into h and ondo. For multilingual RAG corpora, this means character-based chunk sizes produce wildly inconsistent token counts across languages. A 1000-character chunk in English might be 250 tokens, while the same character count in Turkish or Vietnamese could be 400+ tokens, potentially exceeding the embedding model's context window.

24Grounding note

Widdows and Cohen discuss the subtleties of tokenization in Ch. 2, Section 2.3.2. They explain that modern LLMs replaced handcrafted tokenization rules with automated approaches like Byte-Pair Encoding, which "replaces the most frequent byte-pair with a single byte, and keeps going." Since BPE tokens are neither characters nor whitespace-delimited words, the mismatch between character-based chunking and token-based model limits is even more treacherous than it appears. The same text may tokenize very differently across models (e.g., SentencePiece vs. tiktoken). Widdows & Cohen, Issue #45

Evaluating Your Chunking Strategy

25Grounding note

SLP3 §11.2 formalizes the MRR metric used in the code above and provides the complementary metric, mean average precision (MAP). MAP descends the ranked list noting precision only at ranks where a relevant item appears, then averages across queries: MAP = (1/|Q|) * sum of AP(q). For chunking evaluation, MAP is often more informative than recall@k because it rewards systems that rank relevant chunks higher, not just systems that include them somewhere in the top k. If two chunking strategies both achieve 80% recall@5 but one consistently ranks the best chunk at position 1 while the other buries it at position 4, MAP will distinguish them.

26Grounding note

Widdows and Cohen provide helpful historical context for this evaluation approach. In Ch. 2, Section 2.3.3, they describe how the Cranfield experiments of the early 1960s established precision and recall as standard retrieval metrics, and how the TREC conferences scaled this methodology into a shared-task paradigm. The recall@k and MRR metrics used in the code above are direct descendants of these Cranfield/TREC measures. The Cranfield team's "uncomfortable conclusion" that simple single-term indexing outperformed expert human indexers reinforces the article's point that empirical measurement often overturns intuition. Widdows & Cohen, Issue #45

Putting It Together: A Decision Framework

27Grounding note

SLP3 §11.4 formalizes the RAG pipeline as a two-stage system: (1) call a retriever to return R(q) = d1 ... dk, the top-k relevant passages, then (2) create a prompt including q and the retrieved passages, then (3) call an LLM. The generation step models p(x1,...,xn | R(q); "Answer the following question..."; q; x<i). Notice that the retrieved passages R(q) directly condition every token the model generates. Jurafsky and Martin note that "there may be noise in the retrieved passages; some of them may be irrelevant or wrong." Incoherent chunks increase this noise. The generator has no mechanism to repair bad retrieval; it can only work with what it receives.

28Grounding note

Widdows and Cohen discuss the Lewis et al. RAG paper in Ch. 5, Section 5.3.3. They frame RAG as "a computational compromise: it's expensive to train a whole new language model for every domain, but relatively cheap to build a search engine, so we put these together into a hybrid system." They also offer a useful caution: RAG is "easily misinterpreted" because while domain-specific search results help produce more factual answers, they don't constrain the model "to produce only sentences that are equally authoritative." This reinforces why chunk quality matters so much -- poor retrieval doesn't just miss information, it can actively mislead the generation step. Widdows & Cohen, Issue #45

29Grounding note

A useful tangent from Widdows and Cohen, Ch. 6: they discuss how LLMs produce "hallucinations" -- text that is fluent but not factually accurate. They note that the term confabulation may be more apt than hallucination, and that "plausibility in and of itself can be persuasive." This connects to the article's premise: when retrieval feeds the LLM incoherent or off-topic chunks, the model doesn't fail silently -- it confabulates confidently around whatever fragments it receives. Good chunking is a first line of defense against this failure mode. Widdows & Cohen, Issue #45