Hacker News new | ask | show | jobs
by kevingadd 1162 days ago
If you familiarize yourself with ideographic/ideographic-adjacent languages like Japanese or Chinese you will probably notice that they are way more efficient than English. Yet those languages pay a tokenization tax too (thanks in no small part to the decisions of the largely western Unicode committees to favor western character sets - the UTF8 encoding favors ASCII tremendously)
1 comments

I usually use the term "tokenization" to refer to breaking a text into "words" (tokens), although in the examples shown in the article, for the Latin script languages it seems to be doing tokenization into something like morphemes. This has nothing to do with the Unicode UTF-8 encoding system; Hindi would have the same number of tokens if you encode it with UTF-8 (where each character is 3 bytes) or ISCII (where each character is 1 byte).

But when it comes to Chinese...something weird is going on.

The behavior on Chinese is what makes me believe it's tokenizing on something like UTF-8 (hopefully normalized). I'm not sure how else you would get that behavior.

Tokens for non-english languages that are groups of characters just suggests that common groups of 2-3 characters from the training set became tokens, which feels unsurprising. The fallback behavior would be 1 utf8 byte = 1 token.

That might not be true. OpenAI do set a limit of the total number of tokens, and since I'm pretty sure they trained the model and the tokenizer on mostly English text, I assume there's a somewhat proportional bias toward English based on the input dataset to those models.