Tokenization: The Elegant Hack Powering Modern AI
Here's an uncomfortable truth about the AI systems everyone's talking about: they can't read.
When you type "ChatGPT is amazing!" into an LLM, the model doesn't see words.
It sees something like:
["Chat", "G", "PT", " is", " amazing", "!"]
This transformation from human-readable text to model-digestible tokens is tokenization. It's a hack. A remarkably effective hack, refined over decades, but a hack nonetheless. And it's arguably the most under appreciated component of modern AI systems.
Understanding tokenization isn't academic trivia.
It directly impacts:
- Cost: You pay per token, not per word
- Context limits: That 128K context window? It's tokens, not characters
- Model behavior: Why does GPT struggle with counting letters? Tokenization.
- Multilingual performance: Why do non-English languages use more tokens?
- Security: Prompt injection often exploits tokenization edge cases
These aren't separate considerations. They're symptoms of the same design choice. The choice is made not from linguistic theory, but from compression algorithms and corpus statistics.
If you want to understand how large language models behave, fail, and occasionally surprise you, you have to start here. At the seams.
Linguistic Foundations: What Is a "Word" Anyway?
Before we can understand why tokenization is hard, we need to confront an uncomfortable truth: linguists don't agree on what a "word" is.
Most people assume a word is a sequence of characters separated by spaces.
Like this.
Or like this.
Butโฆ maybeโฆ not so muchโฆ like this?
Consider:
The whitespace heuristic fails catastrophically for languages such as Chinese, Japanese, and Thai, which lack word boundaries.
It also fails for agglutinative languages like Turkish, Finnish, and German that compose complex meanings into single orthographic words.
Linguistic Levels of Analysis
Linguists distinguish multiple levels of textual structure.
Graphemes are the smallest units of writing: letters, characters, the atomic symbols of a script. In English, that's a, b, c... In Chinese, each character is its own grapheme: ๅ, ่ฏ, ้พ. An English word might be five graphemes; a Chinese sentence might be three.
Morphemes are the smallest units of meaning.
Take unhappiness: that's three morphemes. un- signals negation (same pattern as unfair, undo). happy is the root. -ness converts adjective to noun
Or consider cats: just cat plus -s for plural
But then there's sang, where past tense isn't a suffix at all. It's encoded in the vowel change from sing. Linguists call this ablaut. Tokenizers call it a headache.
Lexemes are abstract dictionary entries. You don't look up runs, ran, and running separately; they're all instances of the lexeme RUN. The spelling changes, the tense changes, but the core meaning persists.
This abstraction is something humans do effortlessly and tokenizers struggle with. A word-level vocabulary treats run and running as unrelated entries, wasting capacity on redundant semantics. Subword tokenization recovers some of this: running becomes ["run", "ning"]. But the connection is statistical accident, not linguistic insight.
Orthographic words are simply what appears between spaces. A convention that varies wildly across languages. Chinese uses no spaces at all. German compresses entire sentences into single compounds. English can't decide whether "ice cream" is one word or two.
Phonological words are prosodic units in speech that your mouth treats as a single chunk. When you say "going to" as intention (not motion), you don't produce two distinct words. You say "gonna"; one stress pattern, one breath unit, no internal pause. The orthography insists on two words; your vocal tract disagrees.
This mismatch matters: tokenizers typically operate on orthographic boundaries, but meaning often lives in phonological ones.
The Morpheme Insight
Here's the key insight for tokenization: morphemes carry meaning, not orthographic words.
When you read "unhappiness," you don't process it as an atomic unit.
You likely recognize:
un-โ negation (same as inunfair,undo)happyโ the emotional state-nessโ converts adjective to noun
This compositional understanding is precisely what subword tokenization tries to capture.
A good tokenizer should learn that un is a meaningful prefix that appears across many words, rather than treating unhappy and unfair as completely unrelated tokens.
Why This Matters for LLMs
If a model has never seen the word "unhappiness" during training, but it has learned:
un-as a negation prefixhappyas an emotion word-nessas a nominalizer
โฆthen it can potentially generalize to understand the composition of "unhappiness".
This is the promise of subword tokenization: morphologically-aware representations that enable generalization to unseen words.
But here's the catch: modern tokenizers learn these patterns statistically, not linguistically. They don't "know" that un- means negation. They just notice it appears frequently as a prefix. This works remarkably well, but it also leads to some bizarre edge cases we'll explore later.
Historical Evolution: From Whitespace to BPE
The history of tokenization mirrors the evolution of NLP itself.
Era 1: Rule-Based Tokenization (1950s-1990s)
Early NLP systems used hand-crafted rules:
- Split on whitespace
- Handle punctuation (., !, ?)
- Handle contractions (don't โ do + n't OR don't โ don + 't?)
- Handle possessives (John's โ John + 's)
- Handle hyphenation (well-known โ well-known OR well + known?)
The Penn Treebank Standard (1993) established conventions still used today:
- Contractions split:
don'tโdo+n't - Possessives split:
John'sโJohn+'s - Punctuation separated:
end.โend+.
But rule-based systems faced insurmountable challenges:
- Every language needed different rules
- Edge cases proliferated (what about Ph.D.? U.S.A.? 3.14?)
- No way to handle out-of-vocabulary (OOV) words
Era 2: Word-Level Vocabularies (1990s-2017)
Statistical NLP models used fixed vocabularies of the N most common words:
vocabulary = ["the", "a", "is", "happy", "cat", ...] # Top 50,000 words
Unknown words โ <UNK> token
Problems:
- Vocabulary explosion: English has 170,000+ words; technical domains add more
- OOV problem: New words, names, typos โ all become
<UNK> - Morphological blindness:
run,runs,running,runner= 4 separate entries - Storage/memory: Embedding matrices for 50K+ words are huge
The OOV problem was particularly brutal. Imagine a sentiment analysis model encountering:
- Brand names: "I love my new iPhone" โ "I love my new
<UNK>" - Typos: "This is amazign" โ "This is
<UNK>" - Slang: "That's lowkey fire" โ "That's
<UNK><UNK>"
Era 3: Character-Level Models (2015โ2017)
One radical solution: forget words entirely, just use characters.
Vocabulary = {a, b, c, โฆ, A, B, C, โฆ, 0, 1, 2, โฆ, punctuation}
~100โ200 tokens total. No OOV problem ever!
Problems:
- Sequence length explosion:
hello= 5 tokens instead of 1 - Long-range dependencies: Model must learn that
c-a-tmeans the same across huge distances - Computational cost: Attention is O(nยฒ) in sequence length
Character-level models worked for some tasks but struggled with semantic understanding. The model had to learn spelling, morphology, syntax, and semantics all from raw characters.
Too much to ask!
Era 4: Subword Tokenization (2016-Present)
The breakthrough insight: find a middle ground between words and characters.
Instead of a fixed vocabulary of words OR characters, learn a vocabulary of frequent substrings that balance:
- Coverage (no OOV)
- Efficiency (reasonable sequence lengths)
- Semantic coherence (meaningful units)
This is exactly what Byte Pair Encoding (BPE) achieves. Originally a data compression algorithm from 1994, it was adapted for NLP by Sennrich et al. in 2016 and quickly became the foundation for modern tokenization.
Algorithm:
- Start with character vocabulary:
{a, b, c, โฆ, } - Count all adjacent character pairs in training data
- Merge most frequent pair into new token
- Repeat until desired vocabulary size
Example evolution:
Initial: "l o w </w>", "l o w e r </w>", "n e w e s t </w>" Most frequent pair: "e s" โ merge to "es" Result: "l o w </w>", "l o w e r </w>", "n e w es t </w>" Most frequent pair: "es t" โ merge to "est" Result: "l o w </w>", "l o w e r </w>", "n e w est </w>" Most frequent pair: "l o" โ merge to "lo" Result: "lo w </w>", "lo w e r </w>", "n e w est </w>" ... continue until vocabulary size reached
Each merge consolidates the most frequent pattern. After thousands of iterations, the vocabulary stabilizes: common sequences have earned their own tokens, while rare combinations remain fragmented, assembled on demand from smaller pieces.
Word Tokenization
ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท
the ["the"]
running ["run", "ning"]
transformers ["transform", "ers"]
Pneumonoultramicroscopicsilicovolcanoconiosis ["P", "ne", "um", "ono", "ult", "ram", "ic", "ros", "cop", "ic", "s", "il", "ic", "ov", "ol", "can", "oc", "on", "i", "osis"]
The genius: common words become single tokens, rare words decompose into subwords.
Timeline of Major Tokenizers
2016: BPE Original GPT, early transformers
2018: WordPiece BERT, DistilBERT
2018: SentencePiece T5, ALBERT, XLNet, mBART
2019: Unigram LM SentencePiece option
2020: Byte-level BPE GPT-2, GPT-3, GPT-4
2023: tiktoken OpenAI's optimized implementation
The Uncomfortable Truth
Tokenization is a hack.
A remarkably effective hack, but a hack nonetheless.
We wanted models that understand language. We got models that understand statistically-frequent byte sequences. BPE doesn't know that un- means negation. It just noticed the pattern appears often enough to merit its own token. The fact that this correlates with morphological structure is convenient, not intentional.
And yet: it works. LLMs can write poetry, explain quantum mechanics, and debug your code, all while processing text through a compression algorithm from 1994. The gap between "frequency-based substring merging" and "understanding" turns out to be narrower than anyone expected.
Perhaps understanding was never what we thought it was.
So the next time an LLM confidently miscounts the letters in strawberry, remember: it never saw the word. It saw ["str", "aw", "berry"] and did its best.
We all are.