| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by shagie 1161 days ago

Including those alphabets as letters or single glyphs would still leave it so that ドイツ would still take 3 tokens whereas "Germany" is one token ("germany" is two tokens: [ger][many]).

And tossing ドイツ into the tokenizer shows that it is 3 tokens.

Consider also the question "is it useful to just tokenize hiragana or katakana and not all of the kanji characters?"

The glyph by glyph approach to tokenization of non-english text is already present the way that you are describing it - and because it is glyph by glyph that means that it gets expanded out and consumes more tokens.

Korean gets rather interesting because 독일 is not one character but several - multiple sounds are combined into one glyph and each glyph is one syllable. That word is 'dog-il' according to google translate. On the first glyph, ㄷ is 'd' and ㅗ is 'o' and ㄱ is a trailing 'g'. On the second glyph ㅣ is 'i' and ㄹ is a trailing 'l'.

Likewise, its GPT tokenization is 5 tokens.