The 30-Year Journey of an Algorithm That Accidentally Learned to Read
In 1994, Philip Gage published a short paper in the C Users Journal titled "A New Algorithm for Data Compression." It described a simple technique for shrinking files by replacing frequently occurring byte pairs with single bytes.
Thirty years later, that same algorithm (Byte Pair Encoding) powers every major large language model: GPT-4, Claude, LLaMA, Mistral. It's how ChatGPT reads your prompts. It's why you pay per token, not per word.
This is the story of how a compression hack became the foundation of modern AI.
Part 1: The Compression Era (1994-2015)
Philip Gage wasn't thinking about language. He was thinking about bytes.
His problem was practical: how do you compress arbitrary binary data efficiently? The dominant algorithms of the time (Huffman coding, LZW used in GIF files, and arithmetic coding) were sophisticated but complex.
Gage wanted something simpler.
His insight: if two bytes frequently appear next to each other, replace them with a single unused byte.
The algorithm:
- Scan the data for the most frequent pair of adjacent bytes
- Replace all occurrences of that pair with a new byte not in the data
- Record the substitution in a table
- Repeat until no pair occurs more than once (or you run out of unused bytes)
That's it. No probability models. No complex data structures.
Just counting and replacing.
The Actual Mechanics
A byte can represent 256 values (0-255). In any given block of data, not all 256 values are present. ASCII text, for example, typically uses only 70-100 distinct byte values (letters, digits, punctuation, whitespace). The remaining 150+ byte values are "unused" and available as substitution codes.
Gage's algorithm tracks this with two arrays:
unsigned char leftcode[256];
unsigned char rightcode[256];
// Initialize: every byte maps to itself (literal)
for (c = 0; c < 256; c++) {
leftcode[c] = c;
rightcode[c] = 0;
}
The sentinel value leftcode[c] == c means "byte c is a literal." When you create a substitution, you pick an unused byte and set leftcode[unused] and rightcode[unused] to the pair being replaced.
A Worked Example
Suppose your data block contains only the bytes: A, B, C, D (ASCII values 65, 66, 67, 68).
That means bytes 0-64, 69-255 are all unused and available as substitution codes.
Input: AAABDAAABAC
Pass 1: Most frequent pair is "AA" (appears 4 times).
- Find an unused byte. Let's say byte 128 (0x80) isn't in the data.
- Set:
leftcode[128] = 'A',rightcode[128] = 'A' - Replace all "AA" with byte 128:
Result: [128]ABDA[128]BAC
Pass 2: Most frequent pair is now [128]A (appears 2 times).
- Pick another unused byte, say 129.
- Set:
leftcode[129] = 128,rightcode[129] = 'A' - Replace:
Result: [129]BD[129]BC
Pass 3: No pair appears more than once. Stop.
Original: 11 bytes โ Compressed: 6 bytes + lookup table
To decompress, the algorithm uses a stack. When it encounters byte 129, it looks up leftcode[129] and rightcode[129], pushes both onto a stack, and keeps expanding until it hits literals (bytes where leftcode[c] == c).
The Exponential Power
Here's the clever part: pair codes can reference other pair codes.
Byte 129 expands to [128, A], and 128 expands to [A, A]. So 129 represents AAA in a single byte.
Gage noted: "1024 identical bytes can be reduced to a single byte after only ten pair substitutions."
That's 2^10 = 1024.
What Happens When All Bytes Are Used?
This is the natural limit of Gage's original BPE.
From the paper:
"The algorithm repeats this process until no further compression is possible, either because there are no more frequently occurring pairs or there are no more unused bytes to represent pairs."
and:
"If no compression can be performed, BPE passes the data through unchanged except for the addition of a few header bytes to each block of data."
This is why Gage processes data in blocks (default 5000 bytes). Smaller blocks are statistically likely to have many unused byte values. A block of ASCII text might use only 80 of 256 possible byte values, leaving 176 available for substitution codes.
Why It Worked
BPE had several advantages over more sophisticated algorithms:
- Simplicity: The code fit on a single page
- Speed: Linear time complexity per pass
- Adaptivity: No need to pre-specify patterns; it finds them
- Reversibility: Perfect reconstruction guaranteed
It wasn't the best compression algorithm. LZW and arithmetic coding could compress further.
But BPE was good enough and easy to implement.
Life on the Fringes (1994-2015)
BPE never gained mainstream adoption in compression tools.
7-Zip uses LZMA. WinZip and gzip use DEFLATE. bzip2 uses Burrows-Wheeler. These achieved better compression ratios; BPE's advantage was simplicity and a tiny memory footprint, not raw power.
Where did BPE actually live?
- Embedded systems with severe memory constraints (550 bytes for decompression!)
- Educational implementations in algorithms courses
- Academic papers comparing compression techniques
- Niche utilities where "good enough" mattered more than optimal
In 1999, Japanese researchers at Kyushu University and Kyushu Institute of Technology published "Byte Pair Encoding: A Text Compression Scheme That Accelerates Pattern Matching" (Shibata, Kida, Fukamachi, Takeda, Shinohara, Shinohara, and Arikawa).
They discovered something surprising: searching BPE-compressed text was 1.6-1.9x faster than searching the original. The compression reduced data size; the structured byte codes accelerated pattern matching.
However, even this was still about compression and search, rather than language understanding.
The Parallel Bridges (1999-2005)
BPE wasn't the only compression technique finding new applications. During this period, two distinct "bridges" emerged from compression research into adjacent fields:
Figure: The two bridges from compression research (1994-2005) and their relationship to modern AI. Solid arrows show direct lineage; dotted arrows show conceptual influence or independent rediscovery. Note that NCD and embeddings share a philosophical goal (representation without manual features) but have no technical ancestry. The missing spine is language modeling: the insight that compression and prediction are mathematically equivalent.
Part 2: The NLP Problem (2000-2015)
While BPE was compressing files, natural language processing was hitting a wall.
The Vocabulary Explosion
Statistical NLP models require vocabularies, which are lists of known words. But how big should the vocabulary be?
- Too small: Common words only. "The", "a", "is" work fine. But "iPhone"? "COVID-19"? "Beyonce"? All become
<UNK>(unknown). - Too large: Include every word ever written. The vocabulary explodes. English has 500,000+ dictionary words. Add proper nouns, technical terms, misspellings, and you're at millions. Each word needs an embedding vector (hundreds or thousands of parameters). Memory and computation blow up.
The field settled on an uncomfortable compromise: fixed vocabularies of 30,000-50,000 words, with everything else mapped to <UNK>.
This worked... poorly.
Imagine a sentiment analyzer encountering: "The new iPhone is amazign!"
Result: "The new <UNK> is <UNK>!"
The model has no idea what the user is talking about.
The Morphology Blindness
It got worse. Word-level vocabularies couldn't capture morphological structure.
run, runs, running, runner each got their own vocabulary slot, with no shared representation. The model couldn't learn that they're related.
For agglutinative languages like Turkish or Finnish, where words compound endlessly, the problem was catastrophic. A single Turkish word might encode an entire English sentence. No reasonable vocabulary could cover the combinatorial space.
The Character-Level Detour
One radical solution: forget words entirely. Use characters.
Vocabulary size drops to ~100-200 (letters, digits, punctuation). No OOV problem ever. Morphology becomes learnable. running is just r-u-n-n-i-n-g, and the model can see run as a prefix.
But character-level models struggled:
- Sequence explosion: A 500-word paragraph becomes 2,500+ characters
- Long-range dependencies: The model must learn that c-a-t spells "cat" across huge distances
- Computational cost: Attention is O(nยฒ) in sequence length
The models worked for some tasks but couldn't match word-level performance on semantic understanding. Learning spelling, morphology, syntax, and semantics all from raw characters was too much to ask.
The field needed a middle ground.
Part 3: The Breakthrough (2015-2016)
The Lost Bridge
Here's what we don't know: how did Rico Sennrich find BPE?
The 2016 paper cites "Gage (1994)" without providing any narrative. No "I was reading old C Users Journal issues." No "a colleague mentioned this obscure algorithm."
The bridge between a 1994 compression paper in a C programming magazine and 2015 neural machine translation research at Edinburgh is undocumented.
Was it systematic literature mining? Institutional knowledge passed through a department? A chance encounter in an old textbook? We don't know. The discovery story itself is lost to history.
This matters for more than historical curiosity.
It's a reminder about academic contribution:
- Philip Gage finished his work. He didn't abandon a half-baked idea. He wrote complete C code, published detailed explanations, and made his algorithm reproducible.
- He cited his sources. Gage's paper carefully compared BPE to LZW and other algorithms, situating his work in context.
- And he made it accessible. The C Users Journal wasn't Nature, but it was indexed and findable.
Twenty-two years later, someone found it. Research that might have seemed "merely practical" (a niche compression trick for a niche audience) became the foundation of modern AI.
You never know who will need your work, or when.
Sennrich's Insight
In 2015, Rico Sennrich, Barry Haddow, and Alexandra Birch at the University of Edinburgh were working on neural machine translation. They faced the OOV problem acutely: technical terms, names, and rare words mangled their translations.
However they found BPE, they recognized something: what if you applied it to text?
Not bytes. Characters and character sequences.
The algorithm stayed identical:
1. Start with a vocabulary of individual characters 2. Count all adjacent pairs in the training corpus 3. Merge the most frequent pair into a new token 4. Repeat until you reach your target vocabulary size
But the output was something new: a subword vocabulary.
Common words like "the" would get merged all the way to single tokens. Rare words would stay fragmented, assembled from smaller pieces:
Word Tokenization
ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท
the ["the"]
running ["run", "ning"]
transformers ["transform", "ers"]
Pneumonoultramicroscopicsilicovolcanoconiosis ["P", "ne", "um", "ono", "ult", "ram", "ic", "ros", "cop", "ic", "s", "il", "ic", "ov", "ol", "can", "oc", "on", "i", "osis"]
The Paper
Their 2016 paper, "Neural Machine Translation of Rare Words with Subword Units," demonstrated dramatic improvements on translation tasks. Rare words that previously became <UNK> were represented, even if the model had never seen them.
The key insight: BPE automatically identifies morpheme-like units.
The algorithm doesn't know linguistics. It doesn't know that un- is a negation prefix or -ing marks present participles. It just notices that these character sequences appear frequently across many different words.
Frequency, in language, correlates with meaning.
This accidental alignment between compression and linguistics made BPE perfect for NLP.
Part 4: The Scaling Era (2017-2020)
Transformers Enter
In 2017, "Attention Is All You Need" introduced the Transformer architecture. Suddenly, sequence models could be parallelized. Training scaled to unprecedented sizes.
But Transformers still needed tokenization. And BPE (simple, effective, scalable) became the default choice.
- GPT (2018): OpenAI's first GPT used BPE with a 40,000-token vocabulary.
- BERT (2018): Google chose WordPiece (a BPE variant using PMI instead of raw frequency), but the core logic was the same.
- GPT-2 (2019): Introduced Byte-Level BPE, starting from raw bytes (0-255) instead of characters. This guaranteed any Unicode text could be tokenized, including emoji, code, and corrupted data.
- GPT-3 (2020): 175 billion parameters, still using BPE. The vocabulary grew to ~50,000 tokens, but the algorithm remained Philip Gage's 1994 invention.
Part 5: The Variants and Evolutions
Byte-Level BPE (GPT-2+)
Original BPE started with characters. Byte-level BPE starts with raw bytes (0-255).
Why? Characters are encoding-dependent. UTF-8 represents "e" differently than Latin-1. Byte-level BPE sidesteps this: it sees the raw byte sequence, whatever the encoding.
Benefits:
- Truly universal: any byte sequence is tokenizable
- No special handling for Unicode, emoji, or binary data
- Consistent behavior across encodings
The GPT-2/3/4 tokenizer uses byte-level BPE with ~50,000 merge operations.
WordPiece (BERT)
Google's variant changes the merge criterion. Instead of most frequent pair, merge the pair that maximizes:
score(a, b) = frequency(ab) / (frequency(a) ร frequency(b))
This is Pointwise Mutual Information (PMI). It asks: "Do these tokens appear together more than chance predicts?"
- High PMI = "These belong together"
- Low PMI = "They're just both common"
WordPiece uses ## to mark continuation tokens:
"tokenization" โ ["token", "##ization"]
Unigram (SentencePiece)
Unigram inverts the BPE approach:
- Start with a huge vocabulary (all substrings up to some length)
- Train a unigram language model
- Remove tokens that hurt model likelihood least
- Repeat until target vocabulary size
Unlike BPE, Unigram is probabilistic. The same text might have multiple valid tokenizations; the algorithm picks the most probable.
Part 6: Where BPE Lives Today
Model Tokenizer Vocabulary Size ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท GPT-2 BPE 50,257 GPT-3 BPE 50,257 GPT-4 BPE 100,277 BERT WordPiece 30,522 RoBERTa BPE 50,265 T5 SentencePiece 32,000 LLaMA SentencePiece 32,000 LLaMA 2 SentencePiece 32,000 Claude BPE ~100,000 Mistral SentencePiece 32,000
tiktoken: BPE at Scale
OpenAI's tiktoken is BPE implemented in Rust with Python bindings. Same algorithm, 3-6x faster than alternatives.
Why does speed matter? Every API call tokenizes. Every prompt. Every response. Millions per second globally. A 3x speedup saves enormous compute.
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("Hello, world!")
print(tokens) # [9906, 11, 1917, 0]
print(len(tokens)) # 4
The Uncomfortable Truth
Byte Pair Encoding is a compression algorithm.
It doesn't understand language. It understands frequency distributions. It doesn't know grammar. It knows which character sequences appear together often.
And yet: it works.
GPT-4 writes poetry, explains quantum mechanics, and debugs code, all while processing text through an algorithm designed to shrink files in 1994.
The gap between "frequency-based substring merging" and "understanding" turns out to be narrower than anyone expected.
Or perhaps understanding was never what we thought it was.
Philip Gage probably didn't anticipate that his little compression trick would become the reading apparatus of artificial intelligence. He was trying to reduce the file size.
From compression curiosity to the foundation of modern AI.
Year Event ยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยทยท 1994 Philip Gage publishes BPE in C Users Journal 1999 Japanese researchers show BPE accelerates pattern matching 2015 Sennrich applies BPE to neural machine translation 2016 "Neural Machine Translation of Rare Words" published 2017 Transformers architecture introduced 2018 GPT-1 uses BPE (40K vocabulary) 2018 BERT uses WordPiece (BPE variant) 2019 GPT-2 introduces byte-level BPE 2020 GPT-3 scales to 175B parameters, still BPE 2023 GPT-4, Claude, LLaMA all use BPE variants
Sometimes the most important innovations are accidents.
References
Core BPE Papers
Gage, P. (1994). "A New Algorithm for Data Compression." The C Users Journal, Vol. 12, No. 2.
Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., & Arikawa, S. (1999). "Byte Pair Encoding: A Text Compression Scheme That Accelerates Pattern Matching." Technical Report DOI-TR-CS-161, Kyushu University.
Sennrich, R., Haddow, B., & Birch, A. (2016). "Neural Machine Translation of Rare Words with Subword Units." Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), Berlin.
Bridge A: Compression โ Queryability
Ferragina, P. & Manzini, G. (2000). "Opportunistic Data Structures with Applications." FOCS 2000. (FM-Index introduction)
Navarro, G. & Makinen, V. (2007). "Compressed Full-Text Indexes." ACM Computing Surveys.
Bridge B: Compression โ Similarity
Thaper, N. (2001). "A Source-Based Approach to Text Classification." MIT Master's Thesis.
Cilibrasi, R. & Vitanyi, P. (2005). "Clustering by Compression." IEEE Transactions on Information Theory.
Marton, Y., Wu, N., & Hellerstein, L. (2005). "On Compression-Based Text Classification." ECIR 2005.
Transformers and Modern AI
Vaswani, A., et al. (2017). "Attention Is All You Need." NeurIPS.
Radford, A., et al. (2018). "Improving Language Understanding by Generative Pre-Training." OpenAI.
Radford, A., et al. (2019). "Language Models are Unsupervised Multitask Learners." OpenAI.
OpenAI. (2023). tiktoken. GitHub.