Hacker News new | ask | show | jobs
by 29athrowaway 1168 days ago
It is not that tokenization is optimized for English, but rather the other way around perhaps.

Take "lampara" or "pantalones" in Spanish for example. English speakers were clever enough to shorten those words to "lamp" and "pants" respectively. And they have done this with many words.

Translate text into Spanish and you will see text gets longer and there is more meaning encoded into words.

"La mesa" refers to a female table, although tables are not lifeforms and have no sex.

To me some languages impose a communication tax. It is taboo because people conflate language and culture and such.

7 comments

It's funny that you're calling English "effective" because it has shorter words, even though word length has nothing to do with tokenization effectiveness -- if a long word is frequent enough, it becomes a single token. That's the point of doing tokenization instead of feeding raw bytes into the model.

BTW, English might have shorter words than many languages, but the sentences get wordier. For example, English "die" is shorter than Czech "umřít", but the sentence "We are going to die." is much longer than "Umřeme." in Czech.

"la mesa" isn't a female table, it's just a table. If you want to specify that the table is female (in reality) then you might say "mesa hembra". The fact that "mesa" is _grammatically_ feminine is a red herring. It's a rule of the language that occasionally corresponds to nature, but that's in a very limited minority of cases. You can think of grammatical gender like an optional redundant bit (against, say mishearing) when giving some information, but since there's no other way to talk about a table it doesn't give any more information than "the table" when written down.
Yet, "el mesa" is wrong. You have to memorize it.
Also wrong are "a hour" and "an cats". Sometime Spanish uses one word ("hablo") where English needs two ("I speak").

Comparative analysis of language isn't taboo [1]. It's just vastly more complicated than you make out, and the specific examples you chose aren't representative enough to support any point.

You're likely getting downvoted for misunderstanding basic socio-linguistic concepts that belie the confidence of your arguing: conflating biological and grammatical gender, implying that English was created by a committee of clever language designers, a focus on letters and words over concepts and comprehension.

[1] https://www.science.org/doi/10.1126/sciadv.aaw2594

Which you can infer from the word characters instead of memorization.
You use "an" when the word starts with a vowel _sound_, regardless of spelling, and pronunciation has to be memorized in English. "The class lasts an hour and he's getting an MBA" is the correct usage even though they both start with consonants.
One wonders whether highly agglutinative languages, then, might have even better performance than English in the tokenizer since they can pack much more meaning into a single word.

The linked article shows one such language, Malayalam, costing 15.7 times more. Try again.

If you familiarize yourself with ideographic/ideographic-adjacent languages like Japanese or Chinese you will probably notice that they are way more efficient than English. Yet those languages pay a tokenization tax too (thanks in no small part to the decisions of the largely western Unicode committees to favor western character sets - the UTF8 encoding favors ASCII tremendously)
I usually use the term "tokenization" to refer to breaking a text into "words" (tokens), although in the examples shown in the article, for the Latin script languages it seems to be doing tokenization into something like morphemes. This has nothing to do with the Unicode UTF-8 encoding system; Hindi would have the same number of tokens if you encode it with UTF-8 (where each character is 3 bytes) or ISCII (where each character is 1 byte).

But when it comes to Chinese...something weird is going on.

The behavior on Chinese is what makes me believe it's tokenizing on something like UTF-8 (hopefully normalized). I'm not sure how else you would get that behavior.

Tokens for non-english languages that are groups of characters just suggests that common groups of 2-3 characters from the training set became tokens, which feels unsurprising. The fallback behavior would be 1 utf8 byte = 1 token.

That might not be true. OpenAI do set a limit of the total number of tokens, and since I'm pretty sure they trained the model and the tokenizer on mostly English text, I assume there's a somewhat proportional bias toward English based on the input dataset to those models.
Different languages have different levels of conciceness of course, but I highly doubt that Spanish is anywhere close to 15x less concise than english.
I can believe that it's 1.5x less concise and that there's 10x less training data in Spanish compared to English.
Eh… “la mesa” is “the table”, English wins. Even in context, spanish conjunction rules allow you to elide pronouns in many cases that would be confusing in english.

The reason spanish might encode longer is the tokenization scheme compacts tokens based on popularity in training data, and most training data was english. No more no less.

Communication rates are very similar across languages: https://www.science.org/doi/10.1126/sciadv.aaw2594

See also (great read): https://pubmed.ncbi.nlm.nih.gov/31006626/

wrt your Spanish example: grammatical gender adds information redundancy to make it easier to process spoken language (e.g. helps with reference resolution). This redundancy enables Spanish speakers to speak at a relatively fast rate without incurring perception errors. English has fewer words but a slower speech rate. It's an optimization problem.

The speech rate issue isn't as obvious if you're only looking at text, but I'd argue/speculate that lossless speech as a language evolutionary constraint has implications for learnability.

tl;dr there is no communication tax, languages are basically equivalent wrt to information rate, they just solved the optimization problem of compactness vs speech rate differently