Hacker News new | ask | show | jobs
by mgaunard 1170 days ago
So for latin languages, they tokenize per word, and somehow for asian languages, it's tokenizing per radical.

Of course you'd end up with a lot more tokens. Just tokenize by word regardless of language.

3 comments

"word" isn't a useful concept in a lot of languages. Words are obvious in English because English is analytic: https://en.wikipedia.org/wiki/Analytic_language

But there are tons of languages (not just CJK languages) that use either compounding or combos of root + prefix/suffix/infix to express what would be multiple words in English. E.g. German 'Schadenfreude'. Its actually way more useful to tokenize this as separate parts because e.g. 'Freude' might be part of a lot of other "words" as well. So you can share that token across a lot of words, thereby keeping the vocab compact.

Of course most indo-european languages have declensions or at least conjugation. That includes English even if it is overly simplistic there.

CJK languages do not really have that, they don't even have conjugation. They have simple suffixes at best to mark a verb as being interrogative.

Words do exist in all CJK languages and are meaningful. Tokenizing by Hangul radicals doesn't really make sense if you're not going to tokenize by decomposed letter in romance languages too (e.g. hôtel would be h, o, ^, t, e, l).

Words aren't an equivalent count between languages either. English uses a lot of helper words, some other languages use multiple suffixes. Chinese characters don't even make it clear where "word" boundaries are -- there are no spaces.
Chinese does make it explicit where word boundaries are.

The only language that doesn't is Thai, but there are still well-documented algorithms for it.

Really, only Thai? Is there a reference for that? A quick search suggests it’s not the case, but I’m no expert.

As a lowly beginner I find the lack of word boundaries in Thai frustrating but I think it’s just that I have not yet learned to think in syllables, I’m still always sounding them out in my head until I have a word I recognize, there’s no flow.

This seems like something the LLMs should be very good at. Google Translate does OK-ish while Apple just throws up its hands in frustration and refuses to translate Thai texts.

Read the Unicode standard, it covers all of these things.
How does it make it explicit? You need a dictionary to figure it out, no? Same as e.g. Japanese?
Right but such dictionaries are already built in to all major operating systems. The double-click-to-select-word interaction works well with Chinese and Japanese in all major operating systems. Without such dictionaries you can't even implement word selection.
It works until it recognizes 外国人参政権 as foreign/carrot/regime
It's more like some big languages receive special treatment, while everything else is interpreted as a byte stream. In Finnish language, the tokens seem to be arbitrary substrings of average length 3-4, and they rarely correspond to any semantically or grammatically meaningful units.