Breaking Text
Four algorithms. Four trade-offs. None of them know what a word is. That's why your LLM sees 'str', 'aw', 'berry'.
In the previous article, we explored why tokenization is hard. Words aren't what we think they are. Languages don't follow rules. And the whitespace heuristic fails catastrophically outside of English.
AI systems deploy a collection of statistical hacks, each with its own trade-offs, quirks, and failure modes.
These algorithms don't understand language.
They understand frequency distributions.
The Four Approaches
Think of tokenization algorithms like different philosophies for dividing a pizza:
BPE: Start with crumbs. Merge the pieces that appear together most often. Keep merging until you have reasonably-sized slices.
WordPiece: Same idea, but smarter about which pieces "belong" together. Don't just count frequency, but ask whether two pieces appear together more than you'd expect by chance.
SentencePiece: Stop assuming the pizza has natural cutting lines. Maybe it's a calzone. Treat the whole thing as dough and let statistics find the boundaries.
Unigram: Start with the whole pizza. Remove slices that matter least. Keep removing until you hit your target number of pieces.
Same problem. Four very different intuitions.
BPE: The Compression Algorithm That Could
Used by: GPT-2, GPT-3, GPT-4, LLaMA, Mistral, Claude
Byte Pair Encoding wasn't invented for NLP. It was invented for data compression. In 1994.
The insight is elegant: if two bytes frequently appear next to each other, replace them with a single new byte. Repeat until you've compressed the data.
Sennrich et al. (2016) realized this same logic applies to text. Characters that frequently appear together should become a single token.
The Algorithm:
- Start with a vocabulary of individual characters
- Count every adjacent pair in your training data
- Merge the most frequent pair into a new token
- Repeat until you reach your target vocabulary size
Corpus: "low", "lower", "newest" Step 0: l o w </w>, l o w e r </w>, n e w e s t </w> Count pairs: (e,s)=1, (s,t)=1, (l,o)=2, (o,w)=2, (w,e)=1... Merge most frequent: (l,o) → "lo" Result: lo w </w>, lo w e r </w>, n e w e s t </w> Count pairs again: (lo,w)=2, (e,s)=1, (s,t)=1... Merge: (lo,w) → "low" Result: low </w>, low e r </w>, n e w est </w> ...continue for thousands of iterations
After enough merges, common words like "the" become single tokens. Rare words like "Pneumonoultramicroscopicsilicovolcanoconiosis" stay fragmented, assembled from whatever pieces the algorithm has learned.
The genius: Common patterns compress. Rare patterns don't. The vocabulary naturally allocates capacity to what matters most.
The limitation: BPE is greedy. It merges the most frequent pair right now, with no lookahead. A globally suboptimal merge early on can cascade through the entire vocabulary.
WordPiece: Frequency Isn't Everything
Used by: BERT, DistilBERT, ELECTRA
Google's WordPiece looks similar to BPE, but asks a subtler question.
BPE asks: "Which pairs appear most often?"
WordPiece asks: "Which pairs appear together more than you'd expect by chance?"
The difference matters. Consider "th" and "e":
- They're both extremely common
- They often appear together (as "the")
- But they also appear separately all the time
BPE might merge them just because they're frequent. WordPiece checks whether their co-occurrence is surprising.
This is Pointwise Mutual Information (PMI):
score(a, b) = frequency(ab) / (frequency(a) × frequency(b))
High score = "these tokens belong together."
Low score = "they're just both common."
The marker: WordPiece uses ## to indicate continuation tokens:
"tokenization" → ["token", "##ization"]
The ## says: "I'm not a word start. I attach to whatever came before."
This explicit marking helps the model understand token boundaries. When BERT sees ##ization, it knows this isn't a standalone concept, but it's a suffix.
The trade-off: WordPiece is more principled but computationally heavier. You're not just counting; you're computing statistical associations.
SentencePiece: What If Spaces Were Lies?
Used by: T5, ALBERT, XLNet, mBART, LLaMA
Here's the problem with BPE and WordPiece: they assume you know where words begin and end.
For English, that's mostly fine. Split on spaces, handle punctuation, proceed.
For Chinese? There are no spaces.
For Japanese? Some words have spaces, some don't, and the rules are complex.
For German? "Rindfleischetikettierungsueberwachungsaufgabenuebertragungsgesetz" is one word. (It means "beef labeling supervision duties delegation law." Obviously.)
SentencePiece's insight: stop pre-tokenizing.
Traditional pipeline:
Text → Split on spaces → Apply BPE/WordPiece to each word
SentencePiece pipeline:
Text → Apply subword algorithm directly (spaces are just another character)
The marker: Instead of ## for continuations, SentencePiece uses ▁ (Unicode 2581) for word starts:
"Hello world" → ["▁Hello", "▁world"] "你好世界" → ["▁你好", "世界"] (no assumption about word boundaries)
The ▁ says: "A new word starts here." Everything else is continuation.
This makes SentencePiece truly language-agnostic. It doesn't care whether your language uses spaces. It learns boundaries from data.
The flexibility: SentencePiece isn't an algorithm; it's a framework. It can use BPE internally, or it can use Unigram (which we'll cover next). The key innovation is the preprocessing approach, not the merging logic.
Unigram: Start Big, Prune Down
Used by: T5, ALBERT (via SentencePiece)
Every algorithm so far builds up from small pieces. Unigram goes the other direction.
Start with a huge vocabulary — every possible substring up to some length. Then ask: "If I remove this token, how much does it hurt my ability to represent the training data?"
Remove the tokens that hurt least. Keep going until you hit your target size.
The Math:
The Unigram model treats tokenization as probabilistic:
P(text) = P(token_1) × P(token_2) × ... × P(token_n)
For each candidate tokenization, compute the probability. Pick the one with highest probability.
Unlike BPE, which is deterministic (same text → same tokens, always), Unigram is inherently probabilistic. The same text could have multiple valid tokenizations, and the algorithm picks the most likely one.
Example:
"unaffable" could be tokenized as: ["un", "aff", "able"] → P = 0.003 ["una", "ff", "able"] → P = 0.001 ["un", "affable"] → P = 0.008 ← winner
The advantage: Unigram naturally handles ambiguity. When there are multiple reasonable tokenizations, it picks the most probable one based on training data statistics.
The training trick: During training, you can randomly sample from multiple tokenizations (not just the best one). This is called "subword regularization." It helps the model learn that different tokenizations represent the same content.
tiktoken: When Theory Meets Production
Used by: GPT-3.5, GPT-4, ChatGPT
tiktoken isn't a new algorithm. It's BPE, implemented in Rust, wrapped for Python.
Why does this matter?
Because tokenization happens on every single API call. Every prompt. Every response. Millions of times per second across OpenAI's infrastructure.
The numbers:
- 3-6x faster than HuggingFace tokenizers
- Exact compatibility with OpenAI's production systems
- Open source:
pip install tiktoken
import tiktoken enc = tiktoken.encoding_for_model("gpt-4") tokens = enc.encode("Hello, world!") print(tokens) # [9906, 11, 1917, 0] print(len(tokens)) # 4 tokens
Try it yourself:
The lesson: Algorithmic elegance matters, but so does engineering. GPT-4 didn't need a better tokenization algorithm. It needed the same algorithm to run faster.
The Comparison
If you're using a pre-trained model: Use whatever tokenizer it was trained with. Seriously. Don't mix and match. A model trained with BPE will produce garbage if you feed it WordPiece tokens.
If you're training from scratch:
- English-only, BERT-style: WordPiece
- Multilingual: SentencePiece (with Unigram or BPE)
- Generative, GPT-style: BPE (via tiktoken or HuggingFace)
- Research, want flexibility: SentencePiece with Unigram
If you're building applications: Just count your tokens before sending.
import tiktoken def count_tokens(text, model="gpt-4"): enc = tiktoken.encoding_for_model(model) return len(enc.encode(text)) # Now you can estimate costs, check context limits, etc.
What The Algorithms Actually Do
None of these algorithms understand language.
BPE finds frequent patterns. WordPiece finds statistically associated patterns. SentencePiece treats text as a byte stream. Unigram optimizes a probability distribution.
They're all just compression algorithms wearing linguistics as a costume.
And yet: they work. GPT-4 can write sonnets and debug code, processing text through an algorithm designed for data compression in 1994.