← All Articles

Why Non-English Speakers Pay More for AI

Same meaning, different price. The hidden cost of tokenization.

Type "I, for one, welcome our new insect overlords" into GPT-4.

12 (sub-word) tokens.

Type it in Japanese. Twenty-two tokens.
Type it in Hindi. Fifty-six tokens.
Type it in Tamil. Eighty-six tokens.

Same groveling to our arthropod masters. Same model. Up to 7x the cost.

At first glance, this looks unfair. Same intent, same semantics, wildly different bills. But the tokenizer isn't measuring meaning. It's measuring how well your language was economically represented at training time.

Kent Brockman surrenders in 12 tokens. Tamil speakers need 86.

People kneeling before insect overlord
The cost of surrender depends on your alphabet.

At first glance, this looks unfair. Same intent, same semantics, same model, wildly different token counts. But the tokenizer isn't measuring meaning. It's measuring how well your language was economically represented at training time.

Language Tokens Ratio Text
English121.0xI, for one, welcome our new insect overlords.
German171.4xIch für meinen Teil begrüße unsere neuen Insekten-Oberherren.
Dutch171.4xIk, voor mijn part, verwelkom onze nieuwe insectenheersers.
Indonesian181.5xSaya, untuk satu, menyambut penguasa serangga baru kita.
French191.6xPour ma part, je souhaite la bienvenue à nos nouveaux maîtres insectes.
Spanish191.6xYo, por mi parte, doy la bienvenida a nuestros nuevos señores insectos.
Czech191.6xJá osobně vítám naše nové hmyzí pány.
Norwegian201.7xJeg, for min del, ønsker våre nye insektherskere velkommen.
Swedish201.7xJag, för min del, välkomnar våra nya insektshärskare.
Italian201.7xIo, per primo, do il benvenuto ai nostri nuovi signori insetti.
Danish201.7xJeg, for min del, byder vores nye insektherskere velkommen.
Polish211.8xJa, ze swojej strony, witam naszych nowych władców owadów.
Turkish221.8xBen, kendi adıma, yeni böcek efendilerimizi karşılıyorum.
Japanese221.8x私は新しい昆虫の支配者たちを歓迎します。
Portuguese231.9xEu, por minha parte, dou as boas-vindas aos nossos novos senhores insetos.
Hungarian231.9xÉn a magam részéről üdvözlöm új rovar urainkat.
Chinese231.9x我,作为其中一员,欢迎我们新的昆虫霸主。
Romanian242.0xEu, personal, îi întâmpin pe noii noștri stăpâni insecte.
Finnish252.1xMinä puolestani toivotan uudet hyönteisherramme tervetulleiksi.
Russian282.3xЯ, например, приветствую наших новых повелителей-насекомых.
Ukrainian282.3xЯ, наприклад, вітаю наших нових повелителів-комах.
Swahili292.4xMimi, kwa upande wangu, nawakaribisha watawala wetu wapya wa wadudu.
Arabic302.5xأنا، من جهتي، أرحب بسادتنا الحشرات الجدد.
Korean322.7x저는 개인적으로 우리의 새로운 곤충 지배자들을 환영합니다.
Vietnamese342.8xTôi, về phần mình, chào đón những chúa tể côn trùng mới của chúng ta.
Thai443.7xผม ในส่วนตัว ยินดีต้อนรับเจ้านายแมลงคนใหม่ของเรา
Hebrew463.8xאני, מצדי, מקבל בברכה את אדוני החרקים החדשים שלנו.
Greek544.5xΕγώ, προσωπικά, καλωσορίζω τους νέους μας κυρίαρχους έντομα.
Hindi564.7xमैं, अपनी ओर से, हमारे नए कीट स्वामियों का स्वागत करता हूं।
Bengali756.2xআমি, আমার পক্ষ থেকে, আমাদের নতুন পোকা প্রভুদের স্বাগত জানাই।
Tamil867.2xநான், எனது பங்கிற்கு, எங்கள் புதிய பூச்சி அதிபதிகளை வரவேற்கிறேன்.

That outcome follows directly from history.

Byte Pair Encoding began as a compression algorithm in the 1990s. Engineers designed it to shrink arbitrary data by repeatedly merging the most frequent byte sequences. It did not emerge from linguistics. It emerged from pragmatism. When modern language models adopted BPE, they inherited that logic intact.

Try this out as an interactive demo

Frequency determines value.

Substrings that appear often in the training data become single, cheap tokens. Substrings that appear less often remain fragmented into smaller pieces. English benefits because English dominates the corpus. Japanese fragments because the tokenizer encountered its patterns less often during training.

The tokenizer does not struggle with Japanese. It simply stores Japanese inefficiently.

That inefficiency shows up as cost.

Type "Hello, how are you?" in Tamil. Twenty-three tokens.

Languages the tokenizer saw less frequently get understood less efficiently. And the difference isn't small.

Tamil speakers pay 4x what English speakers pay for the same information.

Token counts drive pricing, context limits, and truncation behavior. When Japanese text expands into more tokens, fewer ideas fit into the same context window. Long prompts collapse sooner. Retrieval pipelines return less semantic content per request. None of this looks dramatic in isolation. Together, it reshapes what multilingual users can afford to do.

This behavior does not reflect a mistake. It reflects a design tradeoff. Engineers optimized tokenization for scale, speed, and statistical coverage, not for linguistic equity. A compression algorithm rewards what it sees most often. It always has.

In English corpora, a very small set of function words accounts for an outsized share of all tokens:

Depending on corpus and counting method, the top 50 to 100 words in English often cover 45-60% of all word occurrences in running text. Most of those words are function words.

Language Type Content Words (Lexicon) Function Words (Lexicon) Content Words (Usage) Function Words (Usage)
EnglishAnalytic~99.7% (~170K words)~0.3% (~300 words)~45%~55%
ChineseIsolating~99.5% (~50K words)~0.5% (~250 words)~55%~45%
JapaneseAgglutinative~99.8% (~50K words)~0.2% (~100 words)~60%~40%
RussianFusional~99.5% (~150K words)~0.5% (~400 words)~65%~35%
ArabicFusional~99.6% (~60K words)~0.4% (~200 words)~70%~30%
KoreanAgglutinative~99.8% (~50K words)~0.2% (~80 words)~75%~25%
TurkishAgglutinative~99.9% (~100K words)~0.1% (~50 words)~80%~20%
HungarianAgglutinative~99.9% (~80K words)~0.1% (~40 words)~82%~18%
FinnishAgglutinative~99.9% (~90K words)~0.1% (~40 words)~85%~15%
SwahiliAgglutinative~99.9% (~50K words)~0.1% (~30 words)~88%~12%
InuktitutPolysynthetic~99.95% (~10K roots)~0.05% (~20 words)~95%~5%

English concentrates usage into a few function words.

Function Words in Actual Usage (sorted high → low)

English
55%
Chinese
45%
Japanese
40%
Russian
35%
Arabic
30%
Korean
25%
Turkish
20%
Hungarian
18%
Finnish
15%
Swahili
12%
Inuktitut
5%
█ Function words    ░ Content words

BPE does not care what compresses.

It only cares what repeats.

English repeats for two reasons that compound:

  1. Corpus dominance
    English appears far more often than any other language in LLM training data. That guarantees its surface forms get seen orders of magnitude more times.
  2. Internal repetition structure
    Within English, a small set of function words accounts for roughly half of all usage. Those words repeat constantly and with minimal variation.

Corpus dominance determines which language benefits.

Function-word concentration determines how much it benefits.

Neither alone is sufficient.

Linguists classify languages by how they package meaning into words. Some languages use many small, separate words. Others pack entire sentences into single, complex words. This structural difference has profound implications for how well BPE tokenization can compress text.

The four major types:

Why this matters for tokenization:

BPE learns to compress text by finding repeated byte sequences. Languages with many short, stable, frequently-repeated words give BPE lots of reusable patterns. Languages that encode meaning in long, productive word forms produce fewer exact repetitions and hit a compression ceiling that no amount of training data can overcome.

The charts below show five properties that BPE exploits for compression. Notice how the polygon shrinks from Analytic → Polysynthetic. The smaller the polygon, the harder it is for BPE to achieve efficient tokenization regardless of how much training data exists.

Corpus dominance lowers cost; morphology sets the floor.

Token cost is not a property of meaning. It emerges from repetition, exposure, and structure. Corpus dominance decides which language benefits. Morphology decides how far that benefit can go.

Try It Yourself: The Compression Ceiling Explorer

The charts above show static snapshots, but the real insight comes from seeing how these properties change (or don't) as training data grows.

The interactive demo below lets you experiment with two variables:

  1. Language type: Select Analytic (English), Agglutinative (Turkish), Fusional (Russian), or Polysynthetic (Inuktitut)
  2. Corpus dominance: Slide from 1% to 100% to simulate what happens as a language gets more representation in training data

Watch what happens as you drag the slider. For English, the polygon expands dramatically as more data means more compression. But for Polysynthetic languages, something different happens: the polygon grows, but hits a wall. Even at 100% corpus dominance, it can't reach the compression levels that English achieves at 50%.

This is the ceiling in action. Three of the five axes (Word Boundary Clarity, Function Word Frequency, Surface Form Stability) are fixed because they're determined by grammar, not data. Only two axes (Byte Reuse Potential, Compression Achieved) respond to more training data.

Can you try make Polysynthetic beat Analytic?

Not likely. English not only has corpus dominance, it has the highest structural ceiling. Even in a hypothetical world where Inuktitut dominated the training corpus, it would still tokenize less efficiently than English does today.

It's not just about how much data exists. It's about what compression is possible.

English wins twice: once from corpus dominance (it has the most training data), and again from structural advantage (its grammar creates the most compressible patterns). Other languages can close the first gap with more data. They cannot close the second.

The ceiling is baked into the grammar itself.

The Compounding Costs

Direct Financial Cost

At GPT-4 pricing ($2.50 per million input tokens):

Language Tokens/1000 words Cost Ratio
English~1,300$0.003251.0x
Spanish~1,800$0.004501.4x
Russian~3,000$0.007502.3x
Arabic~3,250$0.008132.5x
Tamil~9,400$0.023507.2x

For a company processing millions of customer queries in Tamil, the infrastructure cost is 7x higher than an English-only competitor.

Context Window Tax

GPT-4's 128K context window sounds huge. But in tokens, not characters.

If you're working in Tamil, that 128K window holds roughly 14K words of content.

Globe made of puzzle pieces with one hemisphere fragmented
The pieces are all there. Some just cost more.

In English, the same window holds ~100K words.

Same advertised limit. 7x less actual capacity.

Quality Degradation

Fragmented tokens don't just cost more. They may perform worse.

When a word is split into pieces, the model must:

  1. Recognize the pieces as belonging together
  2. Compose meaning across token boundaries
  3. Handle the increased sequence length

Research suggests models perform better on languages with more efficient tokenization. The tokenizer isn't neutral infrastructure. It's a performance bottleneck.

Who Pays Most?

Winners:

Moderate tax (2-3x):

Heavy tax (3-5x+):

GPT-4o doubled vocabulary size (~200K tokens vs ~100K). This helps somewhat, allocating more tokens to non-English text.

Some models train tokenizers on balanced multilingual corpora:

These narrow the gap but don't eliminate it.

Language-Specific Models

For high-stakes applications, dedicated models exist:

These optimize tokenization for their target language but sacrifice English performance.


References

1. Petrov, A., et al. (2023). "Language Model Tokenizers Introduce Unfairness Between Languages." arXiv.

2. Ahia, O., et al. (2023). "Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models." EMNLP.

3. Rust, P., et al. (2021). "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models." ACL.

4. OpenAI. (2023). tiktoken. GitHub.

5. Conneau, A., et al. (2020). "Unsupervised Cross-lingual Representation Learning at Scale." ACL.