← All Articles

PRE-MERGER SNAPSHOT This article has been subsumed into Vector RAG: Inside the Dense-Vector Retrieval Stack, which combines this piece with two siblings (vector-database internals and chunking strategies) and adds a Part 4 on how the three layers cascade. The original remains here as a pre-merger snapshot. For the canonical Week 5 reading, follow the link above.

The Art of Chunking

The most consequential decision in a RAG pipeline is the one most teams spend the least time on: how to split documents into pieces.

When teams build retrieval-augmented generation systems, they tend to obsess over the choice of embedding model, the vector database, the reranking strategy. These are important decisions. But the upstream choice, the one that constrains everything downstream, is far more mundane. It is the question of how you break a document into chunks before embedding it.

Get chunking wrong and no amount of engineering elsewhere will save your retrieval quality. Get it right and even a simple pipeline can produce surprisingly good results. This article walks through the major chunking strategies, their tradeoffs, and the practical considerations that should guide your choices.

A Brief History of Text Segmentation

The problem of splitting text into meaningful units is far older than RAG. Information retrieval researchers have wrestled with segmentation since the 1970s, when Salton's SMART system had to decide what constituted a "document" for indexing purposes. In early search engines, the document was typically the unit of retrieval: a whole web page, a whole email, a whole file.

The shift toward sub-document retrieval came gradually. Passage retrieval, where systems return specific paragraphs rather than whole documents, gained traction in the TREC competitions of the late 1990s and early 2000s. Researchers discovered that returning a focused passage often produced better answers than returning an entire relevant document, because the user did not have to hunt for the specific information they needed.

The rise of dense retrieval models in the late 2010s made the segmentation question urgent again. Unlike sparse keyword methods (TF-IDF, BM25), which can score individual terms anywhere in a document, dense models compress the entire input into a single vector. The quality of that vector depends critically on what you feed the model: feed it too much and the representation blurs, feed it too little and it lacks context. The chunking problem, in its modern form, was born.

Why Chunking Matters

Embedding models transform text into dense vectors, typically of 768 or 1536 dimensions. These vectors represent the "meaning" of the input text as a point in high-dimensional space. When you pass a user's query through the same embedding model, you get another vector, and retrieval becomes a nearest-neighbor search.

The problem is that embedding models have context windows, usually between 512 and 8192 tokens depending on the model. You cannot embed an entire 50-page document in a single pass. Even models with larger context windows suffer from a more fundamental issue: as the input text grows longer, the resulting embedding becomes an average of too many ideas.

Consider a research paper that discusses methodology in section 3, results in section 4, and limitations in section 5. If you embed the entire paper as one vector, that vector represents a blurred composite of methodology, results, and limitations simultaneously. When a user asks "What were the limitations of this study?", the embedding of the full paper is a mediocre match because the limitations signal is diluted by everything else.

This is the average meaning problem. A single embedding cannot faithfully represent multiple distinct topics. The solution is to split documents into smaller pieces, each focused enough that its embedding captures a coherent idea. The question is how.

Fixed-Size Chunking: The Baseline

The simplest approach is to split text into chunks of a fixed token count. This is where most teams start, and for good reason: it is easy to implement, easy to reason about, and works better than you might expect.

The choice of chunk size matters more than it might appear. Let's look at what different sizes actually contain, using a passage from a hypothetical textbook on machine learning:

256 tokens (~1 paragraph)

"Gradient descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent. In machine learning, we use it to find the parameters that minimize our loss function. The algorithm computes the gradient of the loss with respect to each parameter, then updates the parameters by subtracting a fraction of the gradient. This fraction is called the learning rate."

At 256 tokens, you capture roughly one focused idea. The embedding will be specific, which is excellent for precision. If someone asks "What is gradient descent?", this chunk is a strong match. But if they ask "How does gradient descent relate to backpropagation?", this chunk alone cannot answer, because that connection was discussed two paragraphs later.

512 tokens (~2-3 paragraphs)

"Gradient descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent. In machine learning, we use it to find the parameters that minimize our loss function. The algorithm computes the gradient of the loss with respect to each parameter, then updates the parameters by subtracting a fraction of the gradient. This fraction is called the learning rate. The choice of learning rate is critical. Too large, and the algorithm will overshoot the minimum, oscillating or diverging entirely. Too small, and convergence becomes impractically slow. Modern practice uses adaptive learning rate methods like Adam, which adjust the rate per parameter based on the history of gradients."

At 512 tokens, you capture a topic with some of its immediate context. The embedding represents a broader concept while remaining reasonably focused. This is the most common default for general-purpose RAG systems, and it represents a sensible middle ground.

1024 tokens (~5-6 paragraphs)

At 1024 tokens, you capture an entire subsection of a document. The embedding now represents a broader theme. This works well when queries are high-level ("Explain the optimization process in neural networks") but poorly when queries are specific ("What is the default learning rate in Adam?").

Here is a simple fixed-size chunker in Python:

↗ docsimport tiktoken

def fixed_size_chunks(text: str, chunk_size: int = 512, encoding_name: str = "cl100k_base") -> list[str]:
    """Split text into fixed-size token chunks."""
    encoder = tiktoken.get_encoding(encoding_name)
    tokens = encoder.encode(text)

    chunks = []
    for i in range(0, len(tokens), chunk_size):
        chunk_tokens = tokens[i : i + chunk_size]
        chunks.append(encoder.decode(chunk_tokens))

    return chunks

# Example usage
text = open("document.txt").read()
chunks = fixed_size_chunks(text, chunk_size=512)
print(f"Created {len(chunks)} chunks")

Fixed-size chunking has one glaring flaw: it is completely ignorant of document structure. A chunk boundary might land in the middle of a sentence, in the middle of a code block, or right between a heading and the paragraph it introduces. This is where overlap comes in.

Overlap: Bridging the Boundaries

When you split a document into non-overlapping chunks, information at the boundaries gets severed. A sentence that begins at the end of chunk 7 and finishes at the start of chunk 8 will be incomplete in both chunks. Neither embedding will faithfully represent what that sentence means.

The standard solution is to overlap chunks. Instead of each chunk starting where the previous one ended, you slide the window forward by less than the full chunk size. With a chunk size of 512 tokens and an overlap of 50 tokens, chunk 1 covers tokens 0-511, chunk 2 covers tokens 462-973, chunk 3 covers tokens 924-1435, and so on.

Typical overlap ratios fall between 10% and 20% of the chunk size. For 512-token chunks, that means 50 to 100 tokens of overlap. The tradeoffs are straightforward:

More overlap means better boundary coverage but more total chunks, more storage, more embedding cost, and more redundancy in retrieval results.
Less overlap means fewer chunks and lower cost but a higher chance of losing information at boundaries.
No overlap is rarely the right choice unless your chunks already align with natural document boundaries (like section headers).

Here is the fixed-size chunker extended with overlap:

def fixed_size_chunks_with_overlap(
    text: str,
    chunk_size: int = 512,
    overlap: int = 50,
    encoding_name: str = "cl100k_base"
) -> list[str]:
    """Split text into fixed-size token chunks with overlap."""
    encoder = tiktoken.get_encoding(encoding_name)
    tokens = encoder.encode(text)

    chunks = []
    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunks.append(encoder.decode(chunk_tokens))
        start += chunk_size - overlap  # slide window

    return chunks

# 512 tokens with ~10% overlap
chunks = fixed_size_chunks_with_overlap(text, chunk_size=512, overlap=50)
print(f"Created {len(chunks)} chunks with overlap")

A common mistake is treating overlap as a guaranteed fix for boundary problems. It helps, but it does not eliminate the issue. If a key concept spans 200 tokens and happens to straddle a boundary, a 50-token overlap will not capture it whole. Overlap is a probabilistic mitigation, not a solution.

Recursive Chunking: Respecting Structure

Documents are not flat streams of tokens. They have structure: titles, sections, subsections, paragraphs, sentences. A chunking strategy that respects this structure will produce more coherent chunks than one that ignores it.

Recursive chunking, popularized by LangChain's RecursiveCharacterTextSplitter, works by attempting to split text using a hierarchy of separators. It first tries to split on the largest structural boundary (like double newlines, which typically indicate section breaks). If the resulting pieces are still too large, it splits those pieces on the next separator (single newlines, which indicate paragraph breaks). If pieces are still too large, it falls to sentence boundaries, then word boundaries.

The separator hierarchy for a typical document looks like this:

↗ docs# LangChain's default separator hierarchy
separators = [
    "\n\n",   # Double newline (section/paragraph breaks)
    "\n",     # Single newline
    " ",      # Space (word boundaries)
    "",       # Character-level (last resort)
]

For Markdown documents, you can use a richer hierarchy:

↗ docs# Markdown-aware separators
markdown_separators = [
    "\n# ",     # H1 headers
    "\n## ",    # H2 headers
    "\n### ",   # H3 headers
    "\n\n",     # Paragraph breaks
    "\n",       # Line breaks
    ". ",       # Sentence boundaries
    " ",        # Word boundaries
]

Here is a practical implementation using LangChain:

↗ docsfrom langchain.text_splitter import RecursiveCharacterTextSplitter

# General-purpose recursive splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,         # characters, not tokens
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
)

text = open("document.txt").read()
chunks = splitter.split_text(text)

for i, chunk in enumerate(chunks[:3]):
    print(f"--- Chunk {i+1} ({len(chunk)} chars) ---")
    print(chunk[:200])
    print()

The advantage of recursive chunking is that most chunks will align with natural boundaries in the text. A paragraph will not be split across two chunks unless it exceeds the maximum chunk size on its own. This produces embeddings that represent coherent thoughts rather than arbitrary slices.

LangChain's RecursiveCharacterTextSplitter has become the de facto standard for this reason. It is not the most sophisticated approach, but it is reliable, well-tested, and handles the common cases well. Most production RAG systems use some variant of this strategy.

Semantic Chunking: Following the Meaning

Fixed-size and recursive chunking both operate on surface-level features of text: token counts, newlines, punctuation. They do not understand what the text is about. Semantic chunking takes a different approach entirely: it uses the content's meaning to decide where to split.

The core idea, articulated clearly by Greg Kamradt in 2023, is to measure the semantic similarity between consecutive sentences. When adjacent sentences are about the same topic, their embeddings will be similar. When the topic shifts, the similarity drops. These drops are natural split points.

The algorithm works as follows:

Split the document into sentences.
Embed each sentence (or small groups of sentences for stability).
Compute the cosine similarity between each consecutive pair of sentence embeddings.
Identify points where the similarity drops below a threshold, or where the drop is significantly larger than average.
Split the document at those points.

Here is a simplified implementation:

↗ docsimport numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

def semantic_chunk(
    text: str,
    threshold_percentile: int = 25,
    model_name: str = "all-MiniLM-L6-v2"
) -> list[str]:
    """Split text into semantic chunks based on topic shifts."""
    # Step 1: Split into sentences
    sentences = [s.strip() for s in text.split(".") if s.strip()]

    if len(sentences) < 3:
        return [text]

    # Step 2: Embed each sentence
    model = SentenceTransformer(model_name)
    embeddings = model.encode(sentences)

    # Step 3: Compute similarities between consecutive sentences
    similarities = []
    for i in range(len(embeddings) - 1):
        sim = cosine_similarity(
            [embeddings[i]], [embeddings[i + 1]]
        )[0][0]
        similarities.append(sim)

    # Step 4: Find split points where similarity drops
    threshold = np.percentile(similarities, threshold_percentile)
    split_indices = [
        i + 1 for i, sim in enumerate(similarities)
        if sim < threshold
    ]

    # Step 5: Build chunks
    chunks = []
    start = 0
    for idx in split_indices:
        chunk = ". ".join(sentences[start:idx]) + "."
        chunks.append(chunk)
        start = idx
    # Don't forget the last chunk
    chunks.append(". ".join(sentences[start:]) + ".")

    return chunks

Semantic chunking produces variable-length chunks that correspond to topical segments of the document. A section discussing methodology might become one chunk, while the following section on experimental results becomes another, regardless of their respective lengths.

The tradeoffs are real. Semantic chunking requires embedding every sentence in the document before you even begin the actual chunking, which makes it significantly more expensive than recursive or fixed-size approaches. For a corpus of 10,000 documents, this preprocessing cost adds up quickly.

When is it worth the cost? Heterogeneous documents benefit most. A document that mixes financial analysis, legal clauses, and technical specifications has dramatic topic shifts that semantic chunking handles gracefully. A research paper with a conventional structure benefits less, because the structure itself (sections, headers) already provides reliable split points that recursive chunking can exploit.

Document-Type-Specific Strategies

The best chunking strategy is the one that understands your documents. Different document types have different natural boundaries, and exploiting those boundaries produces better chunks than any generic approach.

Source Code

Code has explicit structure that text lacks. Functions, classes, and methods are natural chunk boundaries. Splitting a function across two chunks is almost always wrong, because each half will be difficult to interpret without the other.

↗ docsfrom langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    Language,
)

# Python-aware code splitter
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=2000,
    chunk_overlap=200,
)

# This will split on class and function boundaries
code = open("my_module.py").read()
chunks = python_splitter.split_text(code)

LangChain provides language-aware splitters for Python, JavaScript, TypeScript, Go, Rust, Java, and several other languages. These use language-specific separators (class definitions, function definitions, decorators) instead of generic newlines.

Markdown

Markdown documents should be split on headers. Each section under a header forms a natural topical unit. The MarkdownHeaderTextSplitter in LangChain handles this directly:

↗ docsfrom langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

markdown_text = open("document.md").read()
chunks = splitter.split_text(markdown_text)

# Each chunk includes header metadata
for chunk in chunks[:3]:
    print(chunk.metadata)   # {'h1': 'Introduction', 'h2': 'Background'}
    print(chunk.page_content[:100])
    print()

Notice that the splitter automatically attaches header metadata to each chunk. This is enormously useful for retrieval, as we will discuss in the metadata section below.

Legal Documents

Legal text is organized around clauses, sections, and subsections, each with numbering schemes like "Section 4.2(b)(iii)". Splitting on these boundaries preserves the logical units that lawyers and compliance systems need. A regex-based splitter that recognizes common legal numbering patterns will outperform generic chunking on contracts and regulations.

Conversations and Chat Logs

Conversations should be split on turn boundaries. Each turn (or small group of turns) forms a natural chunk. Splitting mid-turn destroys the question-answer pairing that gives conversational text its meaning.

def chunk_conversation(
    messages: list[dict],
    turns_per_chunk: int = 4
) -> list[str]:
    """Chunk a conversation by grouping turns together."""
    chunks = []
    for i in range(0, len(messages), turns_per_chunk):
        group = messages[i : i + turns_per_chunk]
        chunk_text = "\n".join(
            f"{m['role']}: {m['content']}"
            for m in group
        )
        chunks.append(chunk_text)
    return chunks

The general principle is clear: if your documents have structure, use it. Generic chunking is a fallback for when you do not know what your documents look like. Once you do know, specialize.

Tables and Structured Data

Tables are a special challenge. A table row makes little sense without its column headers, and a table split across two chunks will confuse both the embedding model and the language model downstream. The safest approach is to treat each table as an atomic unit. If a table is too large to fit in a single chunk, consider converting it to a series of natural-language statements ("Revenue in Q3 2024 was $4.2M") rather than splitting the table itself.

PDFs with Mixed Content

Real-world PDFs often combine narrative text, tables, images with captions, headers, footers, and page numbers. A robust chunking pipeline for PDFs needs to handle extraction before chunking. Tools like PyMuPDF, Unstructured, or Amazon Textract can classify page elements by type, allowing you to route different content types to different chunking strategies. Narrative paragraphs get recursive chunking; tables get preserved whole; image captions get attached to their nearest text chunk as metadata.

The Chunk Size and Retrieval Quality Curve

There is a fundamental tension in chunk size selection that no strategy fully resolves. Smaller chunks produce more precise embeddings, while larger chunks preserve more context. The relationship between chunk size and retrieval quality is not linear; it is a curve with a peak that depends on the nature of your queries.

Consider two extremes. At one end, you chunk at the sentence level. Each embedding is maximally precise: it represents exactly one idea. If a user's query matches that idea, the retrieval is excellent. But most queries require context that spans multiple sentences, and sentence-level chunks force the language model to reconstruct that context from scattered fragments. Liu et al. (2023) showed that language models struggle when relevant information is distributed across many retrieved passages, particularly when key details end up in the middle of the context window.

At the other extreme, you chunk at the section or page level. Each embedding carries abundant context, but the signal is diluted. A page-level chunk about machine learning optimization will be a decent match for "What is gradient descent?" but also for "What is Adam?" and "What is learning rate scheduling?" and a dozen other queries. Precision suffers.

Precision falls as chunks grow; context coverage rises. Their product peaks in the 256–1024 token range for most use cases.

The sweet spot depends on query type:

Factoid queries ("What year was the company founded?") benefit from smaller chunks (256-512 tokens) that isolate specific facts.
Conceptual queries ("Explain the company's growth strategy") benefit from larger chunks (512-1024 tokens) that capture reasoning and relationships.
Multi-hop queries ("How did the company's growth strategy change after the 2020 acquisition?") may benefit from a combination of chunk sizes, or from a reranking step that assembles context from multiple small chunks.

In practice, 512 tokens with 10-15% overlap is a reasonable starting point for most use cases. But you should always evaluate on your actual queries. Measure retrieval quality (recall@k, precision@k, or MRR) across different chunk sizes with a representative set of questions. The optimal size for your specific application may surprise you.

Metadata Enrichment: Context Beyond the Text

A chunk by itself is just a fragment of text. Without metadata, you lose the ability to filter, cite, and contextualize. Adding structured metadata to each chunk transforms it from an anonymous text snippet into a traceable piece of information.

Essential metadata fields include:

Source document: filename, URL, or document ID. Without this, you cannot tell the user where an answer came from.
Section title: the heading under which this chunk appeared. This enables filtering ("only search in the Methods section") and provides context to the language model.
Page number: critical for PDFs and long documents where users need to verify information.
Chunk index: the position of this chunk within the document, enabling retrieval of adjacent chunks for additional context.
Document type: report, email, contract, transcript. Enables type-based filtering.
Date: when the source document was created or last modified. Essential for time-sensitive domains.

Here is a complete chunking pipeline with metadata enrichment:

↗ docsfrom dataclasses import dataclass, field
from langchain.text_splitter import RecursiveCharacterTextSplitter

@dataclass
class EnrichedChunk:
    text: str
    metadata: dict = field(default_factory=dict)

def chunk_with_metadata(
    text: str,
    source: str,
    doc_type: str = "unknown",
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
) -> list[EnrichedChunk]:
    """Chunk text and attach metadata to each chunk."""

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
    )

    raw_chunks = splitter.split_text(text)
    enriched = []

    for i, chunk_text in enumerate(raw_chunks):
        chunk = EnrichedChunk(
            text=chunk_text,
            metadata={
                "source": source,
                "doc_type": doc_type,
                "chunk_index": i,
                "total_chunks": len(raw_chunks),
                "char_count": len(chunk_text),
            }
        )
        enriched.append(chunk)

    return enriched

# Usage
chunks = chunk_with_metadata(
    text=open("annual_report.txt").read(),
    source="annual_report_2024.pdf",
    doc_type="financial_report",
)

print(chunks[0].metadata)
# {'source': 'annual_report_2024.pdf', 'doc_type': 'financial_report',
#  'chunk_index': 0, 'total_chunks': 47, 'char_count': 987}

Metadata also enables a powerful retrieval pattern: retrieve a chunk, then fetch its neighbors. If chunk 12 is relevant, chunks 11 and 13 probably provide useful context. This "context window expansion" at retrieval time partially compensates for using smaller chunk sizes.

Some teams go further, prepending a summary of the section title or parent document to each chunk before embedding. This biases the embedding to capture not just the chunk's content but its role within the larger document. The cost is slightly larger chunks and more preprocessing, but the improvement in retrieval relevance can be substantial.

Common Pitfalls

Having worked through the strategies, it is worth cataloging the mistakes that appear most often in practice.

Ignoring token vs. character distinctions. LangChain's RecursiveCharacterTextSplitter measures in characters by default, not tokens. A 1000-character chunk is roughly 200-250 tokens, depending on the text. If your embedding model has a 512-token limit and you set chunk_size=512 in a character-based splitter, your chunks will be far too small. Always be explicit about your unit of measurement.

Using the same strategy for all document types. A pipeline that chunks legal contracts and Python source code with the same recursive text splitter is leaving quality on the table. The cost of implementing document-type routing is small compared to the retrieval improvement it yields.

Neglecting to evaluate. Many teams choose a chunk size based on intuition or blog posts and never measure whether a different size would work better. Even a simple experiment, comparing retrieval recall at k=5 across three chunk sizes, can reveal significant differences.

Over-relying on overlap. Some teams set overlap to 50% of chunk size, creating massive redundancy without proportionate benefit. This doubles your storage and embedding costs while only marginally improving boundary coverage. If you need that much overlap, your chunks are probably too small.

Stripping formatting before chunking. Whitespace, headers, and list markers carry structural information that chunking strategies depend on. Aggressively normalizing text before chunking removes the signals that recursive and document-type-specific splitters need to find good boundaries.

Evaluating Your Chunking Strategy

The only reliable way to choose a chunking strategy is to measure its impact on your specific use case. Here is a lightweight evaluation approach that requires no specialized tooling:

from sentence_transformers import SentenceTransformer
import numpy as np

def evaluate_chunking(
    chunks: list[str],
    queries: list[str],
    relevant_chunks: list[list[int]],  # ground truth: indices of relevant chunks per query
    model_name: str = "all-MiniLM-L6-v2",
    k: int = 5,
) -> dict:
    """Evaluate a chunking strategy using recall@k and MRR."""
    model = SentenceTransformer(model_name)

    chunk_embeddings = model.encode(chunks)
    query_embeddings = model.encode(queries)

    # Compute similarities
    similarities = np.dot(query_embeddings, chunk_embeddings.T)

    recall_scores = []
    mrr_scores = []

    for i, query in enumerate(queries):
        # Get top-k chunk indices
        top_k = np.argsort(similarities[i])[::-1][:k]

        # Recall@k: fraction of relevant chunks in top-k
        relevant = set(relevant_chunks[i])
        retrieved = set(top_k.tolist())
        recall = len(relevant & retrieved) / max(len(relevant), 1)
        recall_scores.append(recall)

        # MRR: reciprocal rank of first relevant result
        for rank, idx in enumerate(top_k, 1):
            if idx in relevant:
                mrr_scores.append(1.0 / rank)
                break
        else:
            mrr_scores.append(0.0)

    return {
        "recall@k": np.mean(recall_scores),
        "mrr": np.mean(mrr_scores),
        "num_chunks": len(chunks),
    }

# Compare strategies
for strategy_name, chunks in strategies.items():
    results = evaluate_chunking(chunks, test_queries, ground_truth)
    print(f"{strategy_name}: Recall@5={results['recall@k']:.3f}, MRR={results['mrr']:.3f}, Chunks={results['num_chunks']}")

The ground truth (which chunks are relevant to which queries) does require manual annotation, but even 30-50 annotated query-chunk pairs are enough to reveal meaningful differences between strategies. This is a few hours of work that can save weeks of debugging mysterious retrieval failures.

Putting It Together: A Decision Framework

With so many options, how do you choose? Here is a practical framework:

Recursive baseline first, specialize by document type, attach metadata immediately, then measure before reaching for semantic chunking's compute cost.

Start with recursive chunking. Use LangChain's RecursiveCharacterTextSplitter with 512-1000 characters, 10-20% overlap, and separators appropriate for your document type. This is your baseline.

Specialize for known document types. If your corpus is entirely Markdown, use a Markdown-aware splitter. If it is code, use a language-aware splitter. If it is a mix, classify documents first and route to specialized splitters.

Add metadata from the start. Retrofitting metadata onto an existing chunk store is painful. Build it into the pipeline on day one. At minimum, track source document, chunk position, and section title.

Evaluate empirically. Create a test set of 50-100 representative queries with known relevant documents. Measure retrieval quality across different chunk sizes and strategies. The results will tell you more than any theoretical argument.

Consider semantic chunking for high-value corpora. If your documents are heterogeneous, if topic shifts are unpredictable, and if retrieval quality is critical enough to justify the compute cost, semantic chunking can provide meaningful improvements over structure-based approaches.

The Lewis et al. (2020) RAG paper demonstrated that retrieval quality is the primary bottleneck in retrieval-augmented generation. The language model can only work with what retrieval gives it. If your chunks are incoherent, your embeddings will be imprecise, your retrieval will be noisy, and your generated answers will suffer. Chunking is the foundation.

Most teams spend days evaluating embedding models and hours choosing chunk sizes. Invert that ratio. The embedding model matters, but it operates on what your chunking strategy gives it. Give it coherent, well-bounded, metadata-rich chunks and even a modest embedding model will produce good retrieval. Give it arbitrary slices of text and the best embedding model in the world will struggle.

Chunking is not glamorous work. It does not appear in paper titles or conference talks. But it is where RAG pipelines are won or lost.

. . .

References

Textbook grounding and extended commentary: Sources.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." Advances in Neural Information Processing Systems, 33.
Kamradt, G. (2023). "5 Levels of Text Splitting." Full Stack Retrieval.
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv preprint arXiv:2307.03172.
LangChain. (2024). "Text Splitters." LangChain Documentation.

Chunking RAG Retrieval Context Positioning Document Splitting