โ† All Articles

Tokenization: The Elegant Hack Powering Modern AI

Here's an uncomfortable truth about the AI systems everyone's talking about: they can't read.

When you type "ChatGPT is amazing!" into an LLM, the model doesn't see words.

It sees something like:

["Chat", "G", "PT", " is", " amazing", "!"]

This transformation from human-readable text to model-digestible tokens is tokenization. It's a hack. A remarkably effective hack, refined over decades, but a hack nonetheless. And it's arguably the most under appreciated component of modern AI systems.

Understanding tokenization isn't academic trivia.

It directly impacts:

These aren't separate considerations. They're symptoms of the same design choice. The choice is made not from linguistic theory, but from compression algorithms and corpus statistics.

If you want to understand how large language models behave, fail, and occasionally surprise you, you have to start here. At the seams.

Linguistic Foundations: What Is a "Word" Anyway?

Before we can understand why tokenization is hard, we need to confront an uncomfortable truth: linguists don't agree on what a "word" is.

Most people assume a word is a sequence of characters separated by spaces.

Like this.

Or like this.

Butโ€ฆ maybeโ€ฆ not so muchโ€ฆ like this?

Consider:

Linguistic word examples
Ask a linguist, lose an afternoon
NLP challenges with words
Why NLP Engineers Drink

The whitespace heuristic fails catastrophically for languages such as Chinese, Japanese, and Thai, which lack word boundaries.

It also fails for agglutinative languages like Turkish, Finnish, and German that compose complex meanings into single orthographic words.

Linguistic Levels of Analysis

Linguists distinguish multiple levels of textual structure.

Graphemes are the smallest units of writing: letters, characters, the atomic symbols of a script. In English, that's a, b, c... In Chinese, each character is its own grapheme: ๅˆ†, ่ฏ, ้šพ. An English word might be five graphemes; a Chinese sentence might be three.

Morphemes are the smallest units of meaning.

Take unhappiness: that's three morphemes. un- signals negation (same pattern as unfair, undo). happy is the root. -ness converts adjective to noun

unhappiness = un- (negation) + happy (root) + -ness (nominalization)

Or consider cats: just cat plus -s for plural

cats = cat (root) + -s (plural)

But then there's sang, where past tense isn't a suffix at all. It's encoded in the vowel change from sing. Linguists call this ablaut. Tokenizers call it a headache.

sang = sing (root) + past tense (expressed via ablaut, not suffix!)
Morphemes illustration
At least these have spaces between them.

Lexemes are abstract dictionary entries. You don't look up runs, ran, and running separately; they're all instances of the lexeme RUN. The spelling changes, the tense changes, but the core meaning persists.

This abstraction is something humans do effortlessly and tokenizers struggle with. A word-level vocabulary treats run and running as unrelated entries, wasting capacity on redundant semantics. Subword tokenization recovers some of this: running becomes ["run", "ning"]. But the connection is statistical accident, not linguistic insight.

Orthographic words are simply what appears between spaces. A convention that varies wildly across languages. Chinese uses no spaces at all. German compresses entire sentences into single compounds. English can't decide whether "ice cream" is one word or two.

Phonological words are prosodic units in speech that your mouth treats as a single chunk. When you say "going to" as intention (not motion), you don't produce two distinct words. You say "gonna"; one stress pattern, one breath unit, no internal pause. The orthography insists on two words; your vocal tract disagrees.

This mismatch matters: tokenizers typically operate on orthographic boundaries, but meaning often lives in phonological ones.

The Morpheme Insight

Here's the key insight for tokenization: morphemes carry meaning, not orthographic words.

When you read "unhappiness," you don't process it as an atomic unit.

You likely recognize:

This compositional understanding is precisely what subword tokenization tries to capture.

A good tokenizer should learn that un is a meaningful prefix that appears across many words, rather than treating unhappy and unfair as completely unrelated tokens.

Subword tokenization concept
The pieces are all there. Theoretically.

Why This Matters for LLMs

If a model has never seen the word "unhappiness" during training, but it has learned:

โ€ฆthen it can potentially generalize to understand the composition of "unhappiness".

This is the promise of subword tokenization: morphologically-aware representations that enable generalization to unseen words.

But here's the catch: modern tokenizers learn these patterns statistically, not linguistically. They don't "know" that un- means negation. They just notice it appears frequently as a prefix. This works remarkably well, but it also leads to some bizarre edge cases we'll explore later.

Historical Evolution: From Whitespace to BPE

The history of tokenization mirrors the evolution of NLP itself.

Era 1: Rule-Based Tokenization (1950s-1990s)

Early NLP systems used hand-crafted rules:

The Penn Treebank Standard (1993) established conventions still used today:

But rule-based systems faced insurmountable challenges:

Era 2: Word-Level Vocabularies (1990s-2017)

Statistical NLP models used fixed vocabularies of the N most common words:

vocabulary = ["the", "a", "is", "happy", "cat", ...]  # Top 50,000 words

Unknown words โ†’ <UNK> token

Problems:

The OOV problem was particularly brutal. Imagine a sentiment analysis model encountering:

Era 3: Character-Level Models (2015โ€“2017)

One radical solution: forget words entirely, just use characters.

Vocabulary = {a, b, c, โ€ฆ, A, B, C, โ€ฆ, 0, 1, 2, โ€ฆ, punctuation}

~100โ€“200 tokens total. No OOV problem ever!

Problems:

Character-level models worked for some tasks but struggled with semantic understanding. The model had to learn spelling, morphology, syntax, and semantics all from raw characters.

Too much to ask!

Era 4: Subword Tokenization (2016-Present)

The breakthrough insight: find a middle ground between words and characters.

Instead of a fixed vocabulary of words OR characters, learn a vocabulary of frequent substrings that balance:

This is exactly what Byte Pair Encoding (BPE) achieves. Originally a data compression algorithm from 1994, it was adapted for NLP by Sennrich et al. in 2016 and quickly became the foundation for modern tokenization.

Algorithm:

  1. Start with character vocabulary: {a, b, c, โ€ฆ, }
  2. Count all adjacent character pairs in training data
  3. Merge most frequent pair into new token
  4. Repeat until desired vocabulary size

Example evolution:

Initial: "l o w </w>", "l o w e r </w>", "n e w e s t </w>"
Most frequent pair: "e s" โ†’ merge to "es"
Result: "l o w </w>", "l o w e r </w>", "n e w es t </w>"

Most frequent pair: "es t" โ†’ merge to "est"
Result: "l o w </w>", "l o w e r </w>", "n e w est </w>"

Most frequent pair: "l o" โ†’ merge to "lo"
Result: "lo w </w>", "lo w e r </w>", "n e w est </w>"

... continue until vocabulary size reached

Each merge consolidates the most frequent pattern. After thousands of iterations, the vocabulary stabilizes: common sequences have earned their own tokens, while rare combinations remain fragmented, assembled on demand from smaller pieces.

Word                                            Tokenization
ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท
the                                             ["the"]
running                                         ["run", "ning"]
transformers                                    ["transform", "ers"]
Pneumonoultramicroscopicsilicovolcanoconiosis   ["P", "ne", "um", "ono", "ult", "ram", "ic", "ros", "cop", "ic", "s", "il", "ic", "ov", "ol", "can", "oc", "on", "i", "osis"]
            
Zipf's revenge

The genius: common words become single tokens, rare words decompose into subwords.

Long word tokenization
Pneumonoultramicroscopicsilicovolcanoconiosis enters the chat.

Timeline of Major Tokenizers

2016: BPE Original GPT, early transformers

2018: WordPiece BERT, DistilBERT

2018: SentencePiece T5, ALBERT, XLNet, mBART

2019: Unigram LM SentencePiece option

2020: Byte-level BPE GPT-2, GPT-3, GPT-4

2023: tiktoken OpenAI's optimized implementation

Interactive: watch meaning evaporate through the eras

The Uncomfortable Truth

Tokenization is a hack.

A remarkably effective hack, but a hack nonetheless.

We wanted models that understand language. We got models that understand statistically-frequent byte sequences. BPE doesn't know that un- means negation. It just noticed the pattern appears often enough to merit its own token. The fact that this correlates with morphological structure is convenient, not intentional.

And yet: it works. LLMs can write poetry, explain quantum mechanics, and debug your code, all while processing text through a compression algorithm from 1994. The gap between "frequency-based substring merging" and "understanding" turns out to be narrower than anyone expected.

Perhaps understanding was never what we thought it was.

So the next time an LLM confidently miscounts the letters in strawberry, remember: it never saw the word. It saw ["str", "aw", "berry"] and did its best.

We all are.