Tokenization: The Elegant Hack Powering Modern AI

Here's an uncomfortable truth about the AI systems everyone's talking about: they can't read.

When you type "ChatGPT is amazing!" into an LLM, the model doesn't see words.

It sees something like:

["Chat", "G", "PT", " is", " amazing", "!"]

This transformation from human-readable text to model-digestible tokens is tokenization. It's a hack. A remarkably effective hack, refined over decades, but a hack nonetheless. And it's arguably the most under appreciated component of modern AI systems.

Understanding tokenization isn't academic trivia.

It directly impacts:

Cost: You pay per token, not per word
Context limits: That 128K context window? It's tokens, not characters
Model behavior: Why does GPT struggle with counting letters? Tokenization.
Multilingual performance: Why do non-English languages use more tokens?
Security: Prompt injection often exploits tokenization edge cases

These aren't separate considerations. They're symptoms of the same design choice. The choice is made not from linguistic theory, but from compression algorithms and corpus statistics.

If you want to understand how large language models behave, fail, and occasionally surprise you, you have to start here. At the seams.

Linguistic Foundations: What Is a "Word" Anyway?

Before we can understand why tokenization is hard, we need to confront an uncomfortable truth: linguists don't agree on what a "word" is.

Most people assume a word is a sequence of characters separated by spaces.

Like this.

Or like this.

But… maybe… not so much… like this?

Consider:

Linguistic word examples — Ask a linguist, lose an afternoon

NLP challenges with words — Why NLP Engineers Drink

The whitespace heuristic fails catastrophically for languages such as Chinese, Japanese, and Thai, which lack word boundaries.

It also fails for agglutinative languages like Turkish, Finnish, and German that compose complex meanings into single orthographic words.

Linguistic Levels of Analysis

Linguists distinguish multiple levels of textual structure.

Graphemes are the smallest units of writing: letters, characters, the atomic symbols of a script. In English, that's a, b, c... In Chinese, each character is its own grapheme: 分, 词, 难. An English word might be five graphemes; a Chinese sentence might be three.

Morphemes are the smallest units of meaning.

Take unhappiness: that's three morphemes. un- signals negation (same pattern as unfair, undo). happy is the root. -ness converts adjective to noun

unhappiness = un- (negation) + happy (root) + -ness (nominalization)

Or consider cats: just cat plus -s for plural

cats = cat (root) + -s (plural)

But then there's sang, where past tense isn't a suffix at all. It's encoded in the vowel change from sing. Linguists call this ablaut. Tokenizers call it a headache.

sang = sing (root) + past tense (expressed via ablaut, not suffix!)

Morphemes illustration — At least these have spaces between them.

Lexemes are abstract dictionary entries. You don't look up runs, ran, and running separately; they're all instances of the lexeme RUN. The spelling changes, the tense changes, but the core meaning persists.

This abstraction is something humans do effortlessly and tokenizers struggle with. A word-level vocabulary treats run and running as unrelated entries, wasting capacity on redundant semantics. Subword tokenization recovers some of this: running becomes ["run", "ning"]. But the connection is statistical accident, not linguistic insight.

Orthographic words are simply what appears between spaces. A convention that varies wildly across languages. Chinese uses no spaces at all. German compresses entire sentences into single compounds. English can't decide whether "ice cream" is one word or two.

Phonological words are prosodic units in speech that your mouth treats as a single chunk. When you say "going to" as intention (not motion), you don't produce two distinct words. You say "gonna"; one stress pattern, one breath unit, no internal pause. The orthography insists on two words; your vocal tract disagrees.

This mismatch matters: tokenizers typically operate on orthographic boundaries, but meaning often lives in phonological ones.

The Morpheme Insight

Here's the key insight for tokenization: morphemes carry meaning, not orthographic words.

When you read "unhappiness," you don't process it as an atomic unit.

You likely recognize:

un- → negation (same as in unfair, undo)
happy → the emotional state
-ness → converts adjective to noun

This compositional understanding is precisely what subword tokenization tries to capture.

A good tokenizer should learn that un is a meaningful prefix that appears across many words, rather than treating unhappy and unfair as completely unrelated tokens.

Subword tokenization concept — The pieces are all there. Theoretically.

Why This Matters for LLMs

If a model has never seen the word "unhappiness" during training, but it has learned:

un- as a negation prefix
happy as an emotion word
-ness as a nominalizer

…then it can potentially generalize to understand the composition of "unhappiness".

This is the promise of subword tokenization: morphologically-aware representations that enable generalization to unseen words.

But here's the catch: modern tokenizers learn these patterns statistically, not linguistically. They don't "know" that un- means negation. They just notice it appears frequently as a prefix. This works remarkably well, but it also leads to some bizarre edge cases we'll explore later.

Historical Evolution: From Whitespace to BPE

The history of tokenization mirrors the evolution of NLP itself.

Era 1: Rule-Based Tokenization (1950s-1990s)

Early NLP systems used hand-crafted rules:

Split on whitespace
Handle punctuation (., !, ?)
Handle contractions (don't → do + n't OR don't → don + 't?)
Handle possessives (John's → John + 's)
Handle hyphenation (well-known → well-known OR well + known?)

The Penn Treebank Standard (1993) established conventions still used today:

Contractions split: don't → do + n't
Possessives split: John's → John + 's
Punctuation separated: end. → end + .

But rule-based systems faced insurmountable challenges:

Every language needed different rules
Edge cases proliferated (what about Ph.D.? U.S.A.? 3.14?)
No way to handle out-of-vocabulary (OOV) words

Era 2: Word-Level Vocabularies (1990s-2017)

Statistical NLP models used fixed vocabularies of the N most common words:

vocabulary = ["the", "a", "is", "happy", "cat", ...]  # Top 50,000 words

Unknown words → <UNK> token

Problems:

Vocabulary explosion: English has 170,000+ words; technical domains add more
OOV problem: New words, names, typos → all become <UNK>
Morphological blindness: run, runs, running, runner = 4 separate entries
Storage/memory: Embedding matrices for 50K+ words are huge

The OOV problem was particularly brutal. Imagine a sentiment analysis model encountering:

Brand names: "I love my new iPhone" → "I love my new <UNK>"
Typos: "This is amazign" → "This is <UNK>"
Slang: "That's lowkey fire" → "That's <UNK> <UNK>"

Era 3: Character-Level Models (2015–2017)

One radical solution: forget words entirely, just use characters.

Vocabulary = {a, b, c, …, A, B, C, …, 0, 1, 2, …, punctuation}

~100–200 tokens total. No OOV problem ever!

Problems:

Sequence length explosion: hello = 5 tokens instead of 1
Long-range dependencies: Model must learn that c-a-t means the same across huge distances
Computational cost: Attention is O(n²) in sequence length

Character-level models worked for some tasks but struggled with semantic understanding. The model had to learn spelling, morphology, syntax, and semantics all from raw characters.

Too much to ask!

Era 4: Subword Tokenization (2016-Present)

The breakthrough insight: find a middle ground between words and characters.

Instead of a fixed vocabulary of words OR characters, learn a vocabulary of frequent substrings that balance:

Coverage (no OOV)
Efficiency (reasonable sequence lengths)
Semantic coherence (meaningful units)

This is exactly what Byte Pair Encoding (BPE) achieves. Originally a data compression algorithm from 1994, it was adapted for NLP by Sennrich et al. in 2016 and quickly became the foundation for modern tokenization.

Algorithm:

Start with character vocabulary: {a, b, c, …, }
Count all adjacent character pairs in training data
Merge most frequent pair into new token
Repeat until desired vocabulary size

Example evolution:

Initial: "l o w </w>", "l o w e r </w>", "n e w e s t </w>"
Most frequent pair: "e s" → merge to "es"
Result: "l o w </w>", "l o w e r </w>", "n e w es t </w>"

Most frequent pair: "es t" → merge to "est"
Result: "l o w </w>", "l o w e r </w>", "n e w est </w>"

Most frequent pair: "l o" → merge to "lo"
Result: "lo w </w>", "lo w e r </w>", "n e w est </w>"

... continue until vocabulary size reached

Each merge consolidates the most frequent pattern. After thousands of iterations, the vocabulary stabilizes: common sequences have earned their own tokens, while rare combinations remain fragmented, assembled on demand from smaller pieces.

Word                                            Tokenization
·····················································································································
the                                             ["the"]
running                                         ["run", "ning"]
transformers                                    ["transform", "ers"]
Pneumonoultramicroscopicsilicovolcanoconiosis   ["P", "ne", "um", "ono", "ult", "ram", "ic", "ros", "cop", "ic", "s", "il", "ic", "ov", "ol", "can", "oc", "on", "i", "osis"]

Zipf's revenge

The genius: common words become single tokens, rare words decompose into subwords.

Long word tokenization — Pneumonoultramicroscopicsilicovolcanoconiosis enters the chat.

Timeline of Major Tokenizers

2016: BPE Original GPT, early transformers

2018: WordPiece BERT, DistilBERT

2018: SentencePiece T5, ALBERT, XLNet, mBART

2019: Unigram LM SentencePiece option

2020: Byte-level BPE GPT-2, GPT-3, GPT-4

2023: tiktoken OpenAI's optimized implementation

Interactive: watch meaning evaporate through the eras

The Uncomfortable Truth

Tokenization is a hack.

A remarkably effective hack, but a hack nonetheless.

We wanted models that understand language. We got models that understand statistically-frequent byte sequences. BPE doesn't know that un- means negation. It just noticed the pattern appears often enough to merit its own token. The fact that this correlates with morphological structure is convenient, not intentional.

And yet: it works. LLMs can write poetry, explain quantum mechanics, and debug your code, all while processing text through a compression algorithm from 1994. The gap between "frequency-based substring merging" and "understanding" turns out to be narrower than anyone expected.

Perhaps understanding was never what we thought it was.

So the next time an LLM confidently miscounts the letters in strawberry, remember: it never saw the word. It saw ["str", "aw", "berry"] and did its best.

We all are.