Why Non-English Speakers Pay More for AI
Same meaning, different price. The hidden cost of tokenization.
Type "I, for one, welcome our new insect overlords" into GPT-4.
12 (sub-word) tokens.
Type it in Japanese. Twenty-two tokens.
Type it in Hindi. Fifty-six tokens.
Type it in Tamil. Eighty-six tokens.
Same groveling to our arthropod masters. Same model. Up to 7x the cost.
At first glance, this looks unfair. Same intent, same semantics, wildly different bills. But the tokenizer isn't measuring meaning. It's measuring how well your language was economically represented at training time.
Kent Brockman surrenders in 12 tokens. Tamil speakers need 86.
At first glance, this looks unfair. Same intent, same semantics, same model, wildly different token counts. But the tokenizer isn't measuring meaning. It's measuring how well your language was economically represented at training time.
| Language | Tokens | Ratio | Text |
|---|---|---|---|
| English | 12 | 1.0x | I, for one, welcome our new insect overlords. |
| German | 17 | 1.4x | Ich für meinen Teil begrüße unsere neuen Insekten-Oberherren. |
| Dutch | 17 | 1.4x | Ik, voor mijn part, verwelkom onze nieuwe insectenheersers. |
| Indonesian | 18 | 1.5x | Saya, untuk satu, menyambut penguasa serangga baru kita. |
| French | 19 | 1.6x | Pour ma part, je souhaite la bienvenue à nos nouveaux maîtres insectes. |
| Spanish | 19 | 1.6x | Yo, por mi parte, doy la bienvenida a nuestros nuevos señores insectos. |
| Czech | 19 | 1.6x | Já osobně vítám naše nové hmyzí pány. |
| Norwegian | 20 | 1.7x | Jeg, for min del, ønsker våre nye insektherskere velkommen. |
| Swedish | 20 | 1.7x | Jag, för min del, välkomnar våra nya insektshärskare. |
| Italian | 20 | 1.7x | Io, per primo, do il benvenuto ai nostri nuovi signori insetti. |
| Danish | 20 | 1.7x | Jeg, for min del, byder vores nye insektherskere velkommen. |
| Polish | 21 | 1.8x | Ja, ze swojej strony, witam naszych nowych władców owadów. |
| Turkish | 22 | 1.8x | Ben, kendi adıma, yeni böcek efendilerimizi karşılıyorum. |
| Japanese | 22 | 1.8x | 私は新しい昆虫の支配者たちを歓迎します。 |
| Portuguese | 23 | 1.9x | Eu, por minha parte, dou as boas-vindas aos nossos novos senhores insetos. |
| Hungarian | 23 | 1.9x | Én a magam részéről üdvözlöm új rovar urainkat. |
| Chinese | 23 | 1.9x | 我,作为其中一员,欢迎我们新的昆虫霸主。 |
| Romanian | 24 | 2.0x | Eu, personal, îi întâmpin pe noii noștri stăpâni insecte. |
| Finnish | 25 | 2.1x | Minä puolestani toivotan uudet hyönteisherramme tervetulleiksi. |
| Russian | 28 | 2.3x | Я, например, приветствую наших новых повелителей-насекомых. |
| Ukrainian | 28 | 2.3x | Я, наприклад, вітаю наших нових повелителів-комах. |
| Swahili | 29 | 2.4x | Mimi, kwa upande wangu, nawakaribisha watawala wetu wapya wa wadudu. |
| Arabic | 30 | 2.5x | أنا، من جهتي، أرحب بسادتنا الحشرات الجدد. |
| Korean | 32 | 2.7x | 저는 개인적으로 우리의 새로운 곤충 지배자들을 환영합니다. |
| Vietnamese | 34 | 2.8x | Tôi, về phần mình, chào đón những chúa tể côn trùng mới của chúng ta. |
| Thai | 44 | 3.7x | ผม ในส่วนตัว ยินดีต้อนรับเจ้านายแมลงคนใหม่ของเรา |
| Hebrew | 46 | 3.8x | אני, מצדי, מקבל בברכה את אדוני החרקים החדשים שלנו. |
| Greek | 54 | 4.5x | Εγώ, προσωπικά, καλωσορίζω τους νέους μας κυρίαρχους έντομα. |
| Hindi | 56 | 4.7x | मैं, अपनी ओर से, हमारे नए कीट स्वामियों का स्वागत करता हूं। |
| Bengali | 75 | 6.2x | আমি, আমার পক্ষ থেকে, আমাদের নতুন পোকা প্রভুদের স্বাগত জানাই। |
| Tamil | 86 | 7.2x | நான், எனது பங்கிற்கு, எங்கள் புதிய பூச்சி அதிபதிகளை வரவேற்கிறேன். |
That outcome follows directly from history.
Byte Pair Encoding began as a compression algorithm in the 1990s. Engineers designed it to shrink arbitrary data by repeatedly merging the most frequent byte sequences. It did not emerge from linguistics. It emerged from pragmatism. When modern language models adopted BPE, they inherited that logic intact.
Frequency determines value.
Substrings that appear often in the training data become single, cheap tokens. Substrings that appear less often remain fragmented into smaller pieces. English benefits because English dominates the corpus. Japanese fragments because the tokenizer encountered its patterns less often during training.
The tokenizer does not struggle with Japanese. It simply stores Japanese inefficiently.
That inefficiency shows up as cost.
Type "Hello, how are you?" in Tamil. Twenty-three tokens.
Languages the tokenizer saw less frequently get understood less efficiently. And the difference isn't small.
Tamil speakers pay 4x what English speakers pay for the same information.
Token counts drive pricing, context limits, and truncation behavior. When Japanese text expands into more tokens, fewer ideas fit into the same context window. Long prompts collapse sooner. Retrieval pipelines return less semantic content per request. None of this looks dramatic in isolation. Together, it reshapes what multilingual users can afford to do.
This behavior does not reflect a mistake. It reflects a design tradeoff. Engineers optimized tokenization for scale, speed, and statistical coverage, not for linguistic equity. A compression algorithm rewards what it sees most often. It always has.
In English corpora, a very small set of function words accounts for an outsized share of all tokens:
- Articles: the, a
- Prepositions: of, to, in, for
- Pronouns: I, you, we, it
- Auxiliaries: is, are, was, have
- Conjunctions: and, but, or
- Particles and markers: not, that
Depending on corpus and counting method, the top 50 to 100 words in English often cover 45-60% of all word occurrences in running text. Most of those words are function words.
| Language | Type | Content Words (Lexicon) | Function Words (Lexicon) | Content Words (Usage) | Function Words (Usage) |
|---|---|---|---|---|---|
| English | Analytic | ~99.7% (~170K words) | ~0.3% (~300 words) | ~45% | ~55% |
| Chinese | Isolating | ~99.5% (~50K words) | ~0.5% (~250 words) | ~55% | ~45% |
| Japanese | Agglutinative | ~99.8% (~50K words) | ~0.2% (~100 words) | ~60% | ~40% |
| Russian | Fusional | ~99.5% (~150K words) | ~0.5% (~400 words) | ~65% | ~35% |
| Arabic | Fusional | ~99.6% (~60K words) | ~0.4% (~200 words) | ~70% | ~30% |
| Korean | Agglutinative | ~99.8% (~50K words) | ~0.2% (~80 words) | ~75% | ~25% |
| Turkish | Agglutinative | ~99.9% (~100K words) | ~0.1% (~50 words) | ~80% | ~20% |
| Hungarian | Agglutinative | ~99.9% (~80K words) | ~0.1% (~40 words) | ~82% | ~18% |
| Finnish | Agglutinative | ~99.9% (~90K words) | ~0.1% (~40 words) | ~85% | ~15% |
| Swahili | Agglutinative | ~99.9% (~50K words) | ~0.1% (~30 words) | ~88% | ~12% |
| Inuktitut | Polysynthetic | ~99.95% (~10K roots) | ~0.05% (~20 words) | ~95% | ~5% |
English concentrates usage into a few function words.
BPE does not care what compresses.
It only cares what repeats.
English repeats for two reasons that compound:
- Corpus dominance
English appears far more often than any other language in LLM training data. That guarantees its surface forms get seen orders of magnitude more times. - Internal repetition structure
Within English, a small set of function words accounts for roughly half of all usage. Those words repeat constantly and with minimal variation.
Corpus dominance determines which language benefits.
Function-word concentration determines how much it benefits.
Neither alone is sufficient.
Linguists classify languages by how they package meaning into words. Some languages use many small, separate words. Others pack entire sentences into single, complex words. This structural difference has profound implications for how well BPE tokenization can compress text.
The four major types:
- Analytic: Grammar lives in separate function words: "the," "will," "of." Words stay short and stable. (English, Chinese, Vietnamese)
- Agglutinative: Grammar attaches as chains of suffixes. One word can encode subject, tense, mood, and more. (Turkish, Finnish, Korean, Swahili)
- Fusional: Single affixes encode multiple grammatical features at once, often irregularly. (Russian, Arabic, Spanish)
- Polysynthetic: Entire sentences compress into single words. Extreme morphological productivity. (Inuktitut, Mohawk, Yupik)
Why this matters for tokenization:
BPE learns to compress text by finding repeated byte sequences. Languages with many short, stable, frequently-repeated words give BPE lots of reusable patterns. Languages that encode meaning in long, productive word forms produce fewer exact repetitions and hit a compression ceiling that no amount of training data can overcome.
The charts below show five properties that BPE exploits for compression. Notice how the polygon shrinks from Analytic → Polysynthetic. The smaller the polygon, the harder it is for BPE to achieve efficient tokenization regardless of how much training data exists.
Token cost is not a property of meaning. It emerges from repetition, exposure, and structure. Corpus dominance decides which language benefits. Morphology decides how far that benefit can go.
Try It Yourself: The Compression Ceiling Explorer
The charts above show static snapshots, but the real insight comes from seeing how these properties change (or don't) as training data grows.
The interactive demo below lets you experiment with two variables:
- Language type: Select Analytic (English), Agglutinative (Turkish), Fusional (Russian), or Polysynthetic (Inuktitut)
- Corpus dominance: Slide from 1% to 100% to simulate what happens as a language gets more representation in training data
Watch what happens as you drag the slider. For English, the polygon expands dramatically as more data means more compression. But for Polysynthetic languages, something different happens: the polygon grows, but hits a wall. Even at 100% corpus dominance, it can't reach the compression levels that English achieves at 50%.
This is the ceiling in action. Three of the five axes (Word Boundary Clarity, Function Word Frequency, Surface Form Stability) are fixed because they're determined by grammar, not data. Only two axes (Byte Reuse Potential, Compression Achieved) respond to more training data.
Can you try make Polysynthetic beat Analytic?
Not likely. English not only has corpus dominance, it has the highest structural ceiling. Even in a hypothetical world where Inuktitut dominated the training corpus, it would still tokenize less efficiently than English does today.
English wins twice: once from corpus dominance (it has the most training data), and again from structural advantage (its grammar creates the most compressible patterns). Other languages can close the first gap with more data. They cannot close the second.
The ceiling is baked into the grammar itself.
The Compounding Costs
Direct Financial Cost
At GPT-4 pricing ($2.50 per million input tokens):
| Language | Tokens/1000 words | Cost | Ratio |
|---|---|---|---|
| English | ~1,300 | $0.00325 | 1.0x |
| Spanish | ~1,800 | $0.00450 | 1.4x |
| Russian | ~3,000 | $0.00750 | 2.3x |
| Arabic | ~3,250 | $0.00813 | 2.5x |
| Tamil | ~9,400 | $0.02350 | 7.2x |
For a company processing millions of customer queries in Tamil, the infrastructure cost is 7x higher than an English-only competitor.
Context Window Tax
GPT-4's 128K context window sounds huge. But in tokens, not characters.
If you're working in Tamil, that 128K window holds roughly 14K words of content.
In English, the same window holds ~100K words.
Same advertised limit. 7x less actual capacity.
Quality Degradation
Fragmented tokens don't just cost more. They may perform worse.
When a word is split into pieces, the model must:
- Recognize the pieces as belonging together
- Compose meaning across token boundaries
- Handle the increased sequence length
Research suggests models perform better on languages with more efficient tokenization. The tokenizer isn't neutral infrastructure. It's a performance bottleneck.
Who Pays Most?
Winners:
- English, German, French, Spanish
- Chinese (dedicated vocabulary space)
- Code (heavily represented in training)
Moderate tax (2-3x):
- Russian, Japanese, Korean
- Portuguese, Italian, Dutch
Heavy tax (3-5x+):
- Arabic, Hebrew, Persian
- Hindi, Tamil, Bengali, Telugu
- Thai, Vietnamese, Indonesian
- Swahili, Yoruba, Amharic
- Most languages spoken by <100M people
GPT-4o doubled vocabulary size (~200K tokens vs ~100K). This helps somewhat, allocating more tokens to non-English text.
Some models train tokenizers on balanced multilingual corpora:
- BLOOM: Explicitly balanced across 46 languages
- mT5: Trained on mC4, covering 101 languages
- XLM-RoBERTa: Designed for cross-lingual transfer
These narrow the gap but don't eliminate it.
Language-Specific Models
For high-stakes applications, dedicated models exist:
- Japanese: rinna, ELYZA
- Chinese: ChatGLM, Qwen
- Arabic: Jais, AraGPT2
These optimize tokenization for their target language but sacrifice English performance.
References
1. Petrov, A., et al. (2023). "Language Model Tokenizers Introduce Unfairness Between Languages." arXiv.
2. Ahia, O., et al. (2023). "Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models." EMNLP.
3. Rust, P., et al. (2021). "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models." ACL.
4. OpenAI. (2023). tiktoken. GitHub.
5. Conneau, A., et al. (2020). "Unsupervised Cross-lingual Representation Learning at Scale." ACL.