|
|
|
|
|
by mcswell
1161 days ago
|
|
I usually use the term "tokenization" to refer to breaking a text into "words" (tokens), although in the examples shown in the article, for the Latin script languages it seems to be doing tokenization into something like morphemes. This has nothing to do with the Unicode UTF-8 encoding system; Hindi would have the same number of tokens if you encode it with UTF-8 (where each character is 3 bytes) or ISCII (where each character is 1 byte). But when it comes to Chinese...something weird is going on. |
|
Tokens for non-english languages that are groups of characters just suggests that common groups of 2-3 characters from the training set became tokens, which feels unsurprising. The fallback behavior would be 1 utf8 byte = 1 token.