| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mgaunard 1170 days ago
	So for latin languages, they tokenize per word, and somehow for asian languages, it's tokenizing per radical. Of course you'd end up with a lot more tokens. Just tokenize by word regardless of language.

3 comments

k8si 1170 days ago

"word" isn't a useful concept in a lot of languages. Words are obvious in English because English is analytic: https://en.wikipedia.org/wiki/Analytic_language

But there are tons of languages (not just CJK languages) that use either compounding or combos of root + prefix/suffix/infix to express what would be multiple words in English. E.g. German 'Schadenfreude'. Its actually way more useful to tokenize this as separate parts because e.g. 'Freude' might be part of a lot of other "words" as well. So you can share that token across a lot of words, thereby keeping the vocab compact.

link

mgaunard 1169 days ago

Of course most indo-european languages have declensions or at least conjugation. That includes English even if it is overly simplistic there.

CJK languages do not really have that, they don't even have conjugation. They have simple suffixes at best to mark a verb as being interrogative.

Words do exist in all CJK languages and are meaningful. Tokenizing by Hangul radicals doesn't really make sense if you're not going to tokenize by decomposed letter in romance languages too (e.g. hôtel would be h, o, ^, t, e, l).

link

crazygringo 1170 days ago

Words aren't an equivalent count between languages either. English uses a lot of helper words, some other languages use multiple suffixes. Chinese characters don't even make it clear where "word" boundaries are -- there are no spaces.

link

mgaunard 1170 days ago

Chinese does make it explicit where word boundaries are.

The only language that doesn't is Thai, but there are still well-documented algorithms for it.

link

biztos 1170 days ago

Really, only Thai? Is there a reference for that? A quick search suggests it’s not the case, but I’m no expert.

As a lowly beginner I find the lack of word boundaries in Thai frustrating but I think it’s just that I have not yet learned to think in syllables, I’m still always sounding them out in my head until I have a word I recognize, there’s no flow.

This seems like something the LLMs should be very good at. Google Translate does OK-ish while Apple just throws up its hands in frustration and refuses to translate Thai texts.

link

mgaunard 1169 days ago

Read the Unicode standard, it covers all of these things.

link

crazygringo 1170 days ago

How does it make it explicit? You need a dictionary to figure it out, no? Same as e.g. Japanese?

link

kccqzy 1170 days ago

Right but such dictionaries are already built in to all major operating systems. The double-click-to-select-word interaction works well with Chinese and Japanese in all major operating systems. Without such dictionaries you can't even implement word selection.

link

fomine3 1170 days ago

It works until it recognizes 外国人参政権 as foreign/carrot/regime

link

jltsiren 1170 days ago

It's more like some big languages receive special treatment, while everything else is interpreted as a byte stream. In Finnish language, the tokens seem to be arbitrary substrings of average length 3-4, and they rarely correspond to any semantically or grammatically meaningful units.

link