Not All Languages Are Equal for AI — And It Changes How You Should Work

The Hidden Cost of Language in AI - and what you can do about it

Here is a question worth sitting with: if you type a prompt in French to an AI assistant, are you getting the same quality of result as a colleague who typed the same thing in English?

The answer, structurally speaking, is no.

Not because the AI is less capable — but because of something baked deep into how these systems process language: the token.

This article explains what a token is, why language matters more than most users realise, and — most importantly — what practical habits you can adopt to get the best out of AI regardless of which language you work in.

What is a Token?

AI language models do not read text the way you do. They do not process whole words or sentences. Instead, they break all text into small units called tokens — roughly speaking, a token is a chunk of about 4 characters in English, which often corresponds to a partial word, a full short word, or a common word ending.

The word "Apple" in English? That is 1 token. The French word "Élégance"? That fragments into 2 tokens: "Élég" + "ance." The Hindi greeting "Namaste" in Devanagari script? Up to 6–10 tokens, because the tokenizer must decompose it character by character.

(These counts are illustrative — exact tokenisation varies by model and tokeniser version.)

This is not random. Tokenisers are trained predominantly on English data — for instance, Meta's Llama 3 used 95% English and code in its training set. The result: a rich, efficient vocabulary for English, and a less efficient one for most other languages.

Newer tokenisers (such as GPT-4o's o200k_base and Mistral's Tekken) are beginning to address this imbalance — more on that later.

The Token Tax: A Real Efficiency Gap

Researchers who have measured token consumption across languages use a vivid concept for this: the "token tax." Non-English languages carry a structural inflation premium — they cost more tokens to express the same semantic content.

Think of it as an exchange rate where English is always the reserve currency.

Sources: Petrov et al., “Language Model Tokenizers Introduce Unfairness Between Languages” (NeurIPS 2023); Ahia et al., “Do All Languages Cost the Same?” (2023). Figures are approximate and vary across models and tokenizer versions. Note: “Tokens/Word” is an imperfect metric for Chinese, which lacks clear word boundaries; the commonly used measure for CJK languages is tokens per character.

Why doesn't Chinese escape the tax?

Chinese is a highly dense language — a single character can convey an entire concept. Yet it still pays a premium.

The reason is what we might call the UTF-8 encoding paradox: a Latin character occupies 1 byte in memory, while a standard Chinese character occupies 3 bytes.

Tokenisers use byte-pair encoding trained on text corpora, so both byte representation and frequency in training data determine token efficiency.

A very common Chinese character may be a single token; a rarer one may fragment into two or three. In practice, high semantic density is significantly offset — though not entirely cancelled out — by heavier encoding and lower frequency in training data.

Why Does This Actually Matter? The Context Window Problem

"Tokens" might sound abstract — until you understand that every AI model has a context window: a hard limit on how many tokens it can hold in "memory" at one time. This is not just the prompt you type — it includes the model's response, any documents you share, and any conversation history.

A 128,000-token context window — a common size for current models — allows an English user to load the equivalent of a full novel. The same window, filled with a document in Arabic or Hindi, might hold only a third to a fifth as much content — and for some very low-resource languages, the ratio can be even more extreme (research has documented differences of up to 15×). Same memory. Significantly different capacity.

This has a concrete downstream effect on reasoning quality. Complex tasks — legal analysis, document summarisation, multi-step decision support — rely on the model holding enough context to reason coherently. In high-inflation languages, the context fills up faster, and the model is forced to truncate, compress, or shallow-reason before completing the task.

You are, in effect, paying more compute for less intelligence.

📌

A concrete example

Imagine you upload a 20-page policy briefing in French and ask for a detailed analysis. Compared to the same document in English, the French version will consume more of the model's available context just to be read — leaving less room for the reasoning layer to do its work. The result may be a shallower, more generic analysis.

What You Can Do: Practical Strategies

Understanding the token tax is not an argument for abandoning your working language. It is an argument for being deliberate about language choice as a prompt engineering decision.

Here are four concrete strategies:

1. Prompt in English for complex reasoning tasks

For tasks that require structured, multi-step reasoning — drafting, analysis, classification, code generation — prompting in English is generally the most token-efficient approach. Research (including EPFL's 2024 study on Llama-2) suggests that models often reason through an English-like internal representation, even when prompted in other languages. This advantage is most pronounced for complex, multi-step tasks — though the gap is narrowing with newer models, especially for high-resource European languages like French.

You can always ask the model to respond in your working language:

You can always ask the model to respond in French, or whatever language your output needs to be in:

👉

Example prompt:

"Analyze this document and identify the three main risks. Respond in French."

This gives you English-quality reasoning with a French-language output.

2. Keep prompts concise in non-English languages

Every extra word in a high-token language costs more context space.

Be especially concise when prompting in German, Arabic, or other morphologically complex languages.

Remove courtesy phrases, redundant context, and long preambles.

Get to the instruction directly.

3. Summarise long source documents before deep analysis

If you need to work with a long document in a non-English language, consider asking the AI first to produce a concise summary — then using that summary as the basis for deeper analysis. This two-step approach dramatically reduces token consumption before the reasoning task begins.

4. Use English for chain-of-thought prompting

Techniques like "think step by step" or "first list the constraints, then propose solutions" are highly token-intensive — they work by asking the model to reason out loud.

These techniques are most effective in English, where they exhaust the least context budget. Apply them in English, even when your input or output document is in another language.

European Models Are Closing the Gap

The token tax is not a permanent feature of all AI models equally. European-origin models are actively addressing this gap by training their tokenisers on language-specific corpora — meaning a French-origin model can learn to recognise whole French words as single tokens, reducing fragmentation and bringing costs much closer to English parity.

This is one concrete reason why European-built AI infrastructure matters beyond political considerations: language efficiency is an infrastructure decision, and models trained on European language data are structurally better suited to European language workloads. The GenAI Hub's use of European open-source models is directly relevant here.

The Market Correction: European Models Fighting Back

🇪🇺

EuroLLM — Built from Scratch for All 24 EU Languages

🇪🇺 EuroLLM — Built from Scratch for All 24 EU Languages Origin: EU-funded research consortium including Unbabel, University of Edinburgh, Instituto Superior Técnico (Lisbon), Université Paris-Saclay, and others. Fully open source, Apache 2.0. Launched in September 2024 and trained on EuroHPC infrastructure (MareNostrum 5 supercomputer, Barcelona), EuroLLM is the most ambitious European open-source LLM project to date. Its explicit goal: build models that natively support all 24 official EU languages, not as an afterthought, but as a primary design constraint. The EuroLLM tokeniser uses a byte-fallback BPE SentencePiece vocabulary of 128,000 subwords, trained across the full multilingual pretraining corpus. The result is a median fertility of 1.2–1.4 tokens per word across EU languages — comparable to what English achieves on US-built models. The project has released three generations of models: EuroLLM-1.7B, EuroLLM-9B, and EuroLLM-22B (the latest, released in 2025), with EuroLLM-22B ranking as the best fully open European-made LLM across multilingual benchmarks at the time of release. The important nuance: EuroLLM is a research-grade open model — it is not yet in the same capability tier as commercial models like GPT-4o or Claude for general reasoning tasks. Its strength is multilingual coverage and tokenisation equity across EU languages. Think of it as the infrastructure layer maturing, not a finished product.

Sources: EuroHPC JU (eurohpc-ju.europa.eu, 2024); HuggingFace EuroLLM-22B blog (2025); Martins et al., EuroLLM: Multilingual Language Models for Europe, 2024.

🇪🇺 Mistral AI — Closing the Gap for European Languages Origin: French AI company, founded 2023. Open-weight models, Apache 2.0 licence. Early Mistral models used a 32,000-token SentencePiece vocabulary — moderate but not exceptional for multilingual use. With Mistral NeMo (2024), Mistral introduced a new tokeniser called Tekken, trained on over 100 languages with a vocabulary 10× larger than its predecessor. The results: Tekken is approximately 30% more efficient at compressing French, German, Spanish, Italian, Chinese and Russian text, and 2× to 3× more efficient for Korean and Arabic, compared to the previous Mistral tokeniser. Against the Llama 3 tokeniser, Tekken outperforms on approximately 85% of languages tested. What this means in practice: The gap between English and French tokenisation cost, which stood at roughly +50% on standard US models, narrows substantially on recent Mistral models — potentially below +10% cost overhead for French. This makes Mistral models structurally advantageous for European-language workloads, independent of any other quality considerations. Source: Mistral AI (mistral.ai/news/mistral-nemo), 2024.

💡

Key Takeaways

AI models process text as tokens, not words — and English is structurally cheaper to process than other languages.
Non-English languages carry a "token tax" — consuming more context window for the same content, which can reduce reasoning depth.
The UTF-8 encoding of non-Latin scripts means even information-dense languages like Chinese are not exempt.
Practical fix: prompt in English for complex reasoning tasks, even when you need output in another language.
European-origin models are beginning to close this gap for European languages through language-specific tokeniser training.
Try it yourself: take a prompt you usually write in your language, rewrite the instruction in English with "Respond in [language]" at the end, and compare the results.