Why Non-English Speakers Pay More for AI

Same meaning, different price. The hidden cost of tokenization.

Type "I, for one, welcome our new insect overlords" into GPT-4.

12 (sub-word) tokens.

Type it in Japanese. Twenty-two tokens.
Type it in Hindi. Fifty-six tokens.
Type it in Tamil. Eighty-six tokens.

Same groveling to our arthropod masters. Same model. Up to 7x the cost.

At first glance, this looks unfair. Same intent, same semantics, wildly different bills. But the tokenizer isn't measuring meaning. It's measuring how well your language was economically represented at training time.

Kent Brockman surrenders in 12 tokens. Tamil speakers need 86.

People kneeling before insect overlord — The cost of surrender depends on your alphabet.

At first glance, this looks unfair. Same intent, same semantics, same model, wildly different token counts. But the tokenizer isn't measuring meaning. It's measuring how well your language was economically represented at training time.

Language	Tokens	Ratio	Text
English	12	1.0x	I, for one, welcome our new insect overlords.
German	17	1.4x	Ich für meinen Teil begrüße unsere neuen Insekten-Oberherren.
Dutch	17	1.4x	Ik, voor mijn part, verwelkom onze nieuwe insectenheersers.
Indonesian	18	1.5x	Saya, untuk satu, menyambut penguasa serangga baru kita.
French	19	1.6x	Pour ma part, je souhaite la bienvenue à nos nouveaux maîtres insectes.
Spanish	19	1.6x	Yo, por mi parte, doy la bienvenida a nuestros nuevos señores insectos.
Czech	19	1.6x	Já osobně vítám naše nové hmyzí pány.
Norwegian	20	1.7x	Jeg, for min del, ønsker våre nye insektherskere velkommen.
Swedish	20	1.7x	Jag, för min del, välkomnar våra nya insektshärskare.
Italian	20	1.7x	Io, per primo, do il benvenuto ai nostri nuovi signori insetti.
Danish	20	1.7x	Jeg, for min del, byder vores nye insektherskere velkommen.
Polish	21	1.8x	Ja, ze swojej strony, witam naszych nowych władców owadów.
Turkish	22	1.8x	Ben, kendi adıma, yeni böcek efendilerimizi karşılıyorum.
Japanese	22	1.8x	私は新しい昆虫の支配者たちを歓迎します。
Portuguese	23	1.9x	Eu, por minha parte, dou as boas-vindas aos nossos novos senhores insetos.
Hungarian	23	1.9x	Én a magam részéről üdvözlöm új rovar urainkat.
Chinese	23	1.9x	我，作为其中一员，欢迎我们新的昆虫霸主。
Romanian	24	2.0x	Eu, personal, îi întâmpin pe noii noștri stăpâni insecte.
Finnish	25	2.1x	Minä puolestani toivotan uudet hyönteisherramme tervetulleiksi.
Russian	28	2.3x	Я, например, приветствую наших новых повелителей-насекомых.
Ukrainian	28	2.3x	Я, наприклад, вітаю наших нових повелителів-комах.
Swahili	29	2.4x	Mimi, kwa upande wangu, nawakaribisha watawala wetu wapya wa wadudu.
Arabic	30	2.5x	أنا، من جهتي، أرحب بسادتنا الحشرات الجدد.
Korean	32	2.7x	저는 개인적으로 우리의 새로운 곤충 지배자들을 환영합니다.
Vietnamese	34	2.8x	Tôi, về phần mình, chào đón những chúa tể côn trùng mới của chúng ta.
Thai	44	3.7x	ผม ในส่วนตัว ยินดีต้อนรับเจ้านายแมลงคนใหม่ของเรา
Hebrew	46	3.8x	אני, מצדי, מקבל בברכה את אדוני החרקים החדשים שלנו.
Greek	54	4.5x	Εγώ, προσωπικά, καλωσορίζω τους νέους μας κυρίαρχους έντομα.
Hindi	56	4.7x	मैं, अपनी ओर से, हमारे नए कीट स्वामियों का स्वागत करता हूं।
Bengali	75	6.2x	আমি, আমার পক্ষ থেকে, আমাদের নতুন পোকা প্রভুদের স্বাগত জানাই।
Tamil	86	7.2x	நான், எனது பங்கிற்கு, எங்கள் புதிய பூச்சி அதிபதிகளை வரவேற்கிறேன்.

That outcome follows directly from history.

Byte Pair Encoding began as a compression algorithm in the 1990s. Engineers designed it to shrink arbitrary data by repeatedly merging the most frequent byte sequences. It did not emerge from linguistics. It emerged from pragmatism. When modern language models adopted BPE, they inherited that logic intact.

Try this out as an interactive demo

Frequency determines value.

Substrings that appear often in the training data become single, cheap tokens. Substrings that appear less often remain fragmented into smaller pieces. English benefits because English dominates the corpus. Japanese fragments because the tokenizer encountered its patterns less often during training.

The tokenizer does not struggle with Japanese. It simply stores Japanese inefficiently.

That inefficiency shows up as cost.

Type "Hello, how are you?" in Tamil. Twenty-three tokens.

Languages the tokenizer saw less frequently get understood less efficiently. And the difference isn't small.

Tamil speakers pay 4x what English speakers pay for the same information.

Token counts drive pricing, context limits, and truncation behavior. When Japanese text expands into more tokens, fewer ideas fit into the same context window. Long prompts collapse sooner. Retrieval pipelines return less semantic content per request. None of this looks dramatic in isolation. Together, it reshapes what multilingual users can afford to do.

This behavior does not reflect a mistake. It reflects a design tradeoff. Engineers optimized tokenization for scale, speed, and statistical coverage, not for linguistic equity. A compression algorithm rewards what it sees most often. It always has.

In English corpora, a very small set of function words accounts for an outsized share of all tokens:

Articles: the, a
Prepositions: of, to, in, for
Pronouns: I, you, we, it
Auxiliaries: is, are, was, have
Conjunctions: and, but, or
Particles and markers: not, that

Depending on corpus and counting method, the top 50 to 100 words in English often cover 45-60% of all word occurrences in running text. Most of those words are function words.

Language	Type	Content Words (Lexicon)	Function Words (Lexicon)	Content Words (Usage)	Function Words (Usage)
English	Analytic	~99.7% (~170K words)	~0.3% (~300 words)	~45%	~55%
Chinese	Isolating	~99.5% (~50K words)	~0.5% (~250 words)	~55%	~45%
Japanese	Agglutinative	~99.8% (~50K words)	~0.2% (~100 words)	~60%	~40%
Russian	Fusional	~99.5% (~150K words)	~0.5% (~400 words)	~65%	~35%
Arabic	Fusional	~99.6% (~60K words)	~0.4% (~200 words)	~70%	~30%
Korean	Agglutinative	~99.8% (~50K words)	~0.2% (~80 words)	~75%	~25%
Turkish	Agglutinative	~99.9% (~100K words)	~0.1% (~50 words)	~80%	~20%
Hungarian	Agglutinative	~99.9% (~80K words)	~0.1% (~40 words)	~82%	~18%
Finnish	Agglutinative	~99.9% (~90K words)	~0.1% (~40 words)	~85%	~15%
Swahili	Agglutinative	~99.9% (~50K words)	~0.1% (~30 words)	~88%	~12%
Inuktitut	Polysynthetic	~99.95% (~10K roots)	~0.05% (~20 words)	~95%	~5%

English concentrates usage into a few function words.

Function Words in Actual Usage (sorted high → low)

English

55%

Chinese

45%

Japanese

40%

Russian

35%

Arabic

30%

Korean

25%

Turkish

20%

Hungarian

18%

Finnish

15%

Swahili

12%

Inuktitut

█ Function words ░ Content words

BPE does not care what compresses.

It only cares what repeats.

English repeats for two reasons that compound:

Corpus dominance
English appears far more often than any other language in LLM training data. That guarantees its surface forms get seen orders of magnitude more times.
Internal repetition structure
Within English, a small set of function words accounts for roughly half of all usage. Those words repeat constantly and with minimal variation.

Corpus dominance determines which language benefits.

Function-word concentration determines how much it benefits.

Neither alone is sufficient.

Linguists classify languages by how they package meaning into words. Some languages use many small, separate words. Others pack entire sentences into single, complex words. This structural difference has profound implications for how well BPE tokenization can compress text.

The four major types:

Analytic: Grammar lives in separate function words: "the," "will," "of." Words stay short and stable. (English, Chinese, Vietnamese)
Agglutinative: Grammar attaches as chains of suffixes. One word can encode subject, tense, mood, and more. (Turkish, Finnish, Korean, Swahili)
Fusional: Single affixes encode multiple grammatical features at once, often irregularly. (Russian, Arabic, Spanish)
Polysynthetic: Entire sentences compress into single words. Extreme morphological productivity. (Inuktitut, Mohawk, Yupik)

Why this matters for tokenization:

BPE learns to compress text by finding repeated byte sequences. Languages with many short, stable, frequently-repeated words give BPE lots of reusable patterns. Languages that encode meaning in long, productive word forms produce fewer exact repetitions and hit a compression ceiling that no amount of training data can overcome.

The charts below show five properties that BPE exploits for compression. Notice how the polygon shrinks from Analytic → Polysynthetic. The smaller the polygon, the harder it is for BPE to achieve efficient tokenization regardless of how much training data exists.

Corpus dominance lowers cost; morphology sets the floor.

Token cost is not a property of meaning. It emerges from repetition, exposure, and structure. Corpus dominance decides which language benefits. Morphology decides how far that benefit can go.

Try It Yourself: The Compression Ceiling Explorer

The charts above show static snapshots, but the real insight comes from seeing how these properties change (or don't) as training data grows.

The interactive demo below lets you experiment with two variables:

Language type: Select Analytic (English), Agglutinative (Turkish), Fusional (Russian), or Polysynthetic (Inuktitut)
Corpus dominance: Slide from 1% to 100% to simulate what happens as a language gets more representation in training data

Watch what happens as you drag the slider. For English, the polygon expands dramatically as more data means more compression. But for Polysynthetic languages, something different happens: the polygon grows, but hits a wall. Even at 100% corpus dominance, it can't reach the compression levels that English achieves at 50%.

This is the ceiling in action. Three of the five axes (Word Boundary Clarity, Function Word Frequency, Surface Form Stability) are fixed because they're determined by grammar, not data. Only two axes (Byte Reuse Potential, Compression Achieved) respond to more training data.

Can you try make Polysynthetic beat Analytic?

Not likely. English not only has corpus dominance, it has the highest structural ceiling. Even in a hypothetical world where Inuktitut dominated the training corpus, it would still tokenize less efficiently than English does today.

It's not just about how much data exists. It's about what compression is possible.

English wins twice: once from corpus dominance (it has the most training data), and again from structural advantage (its grammar creates the most compressible patterns). Other languages can close the first gap with more data. They cannot close the second.

The ceiling is baked into the grammar itself.

The Compounding Costs

Direct Financial Cost

At GPT-4 pricing ($2.50 per million input tokens):

Language	Tokens/1000 words	Cost	Ratio
English	~1,300	$0.00325	1.0x
Spanish	~1,800	$0.00450	1.4x
Russian	~3,000	$0.00750	2.3x
Arabic	~3,250	$0.00813	2.5x
Tamil	~9,400	$0.02350	7.2x

For a company processing millions of customer queries in Tamil, the infrastructure cost is 7x higher than an English-only competitor.

Context Window Tax

GPT-4's 128K context window sounds huge. But in tokens, not characters.

If you're working in Tamil, that 128K window holds roughly 14K words of content.

Globe made of puzzle pieces with one hemisphere fragmented — The pieces are all there. Some just cost more.

In English, the same window holds ~100K words.

Same advertised limit. 7x less actual capacity.

Quality Degradation

Fragmented tokens don't just cost more. They may perform worse.

When a word is split into pieces, the model must:

Recognize the pieces as belonging together
Compose meaning across token boundaries
Handle the increased sequence length

Research suggests models perform better on languages with more efficient tokenization. The tokenizer isn't neutral infrastructure. It's a performance bottleneck.

Who Pays Most?

Winners:

English, German, French, Spanish
Chinese (dedicated vocabulary space)
Code (heavily represented in training)

Moderate tax (2-3x):

Russian, Japanese, Korean
Portuguese, Italian, Dutch

Heavy tax (3-5x+):

Arabic, Hebrew, Persian
Hindi, Tamil, Bengali, Telugu
Thai, Vietnamese, Indonesian
Swahili, Yoruba, Amharic
Most languages spoken by <100M people

GPT-4o doubled vocabulary size (~200K tokens vs ~100K). This helps somewhat, allocating more tokens to non-English text.

Some models train tokenizers on balanced multilingual corpora:

BLOOM: Explicitly balanced across 46 languages
mT5: Trained on mC4, covering 101 languages
XLM-RoBERTa: Designed for cross-lingual transfer

These narrow the gap but don't eliminate it.

Language-Specific Models

For high-stakes applications, dedicated models exist:

Japanese: rinna, ELYZA
Chinese: ChatGLM, Qwen
Arabic: Jais, AraGPT2

These optimize tokenization for their target language but sacrifice English performance.

References

1. Petrov, A., et al. (2023). "Language Model Tokenizers Introduce Unfairness Between Languages." arXiv.

2. Ahia, O., et al. (2023). "Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models." EMNLP.

3. Rust, P., et al. (2021). "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models." ACL.

4. OpenAI. (2023). tiktoken. GitHub.

5. Conneau, A., et al. (2020). "Unsupervised Cross-lingual Representation Learning at Scale." ACL.