|
|
|
|
|
by shagie
1161 days ago
|
|
Including those alphabets as letters or single glyphs would still leave it so that ドイツ would still take 3 tokens whereas "Germany" is one token ("germany" is two tokens: [ger][many]). And tossing ドイツ into the tokenizer shows that it is 3 tokens. Consider also the question "is it useful to just tokenize hiragana or katakana and not all of the kanji characters?" The glyph by glyph approach to tokenization of non-english text is already present the way that you are describing it - and because it is glyph by glyph that means that it gets expanded out and consumes more tokens. Korean gets rather interesting because 독일 is not one character but several - multiple sounds are combined into one glyph and each glyph is one syllable. That word is 'dog-il' according to google translate. On the first glyph, ㄷ is 'd' and ㅗ is 'o' and ㄱ is a trailing 'g'. On the second glyph ㅣ is 'i' and ㄹ is a trailing 'l'. Likewise, its GPT tokenization is 5 tokens. |
|