Hacker News new | ask | show | jobs
by mgaunard 1171 days ago
Chinese does make it explicit where word boundaries are.

The only language that doesn't is Thai, but there are still well-documented algorithms for it.

2 comments

Really, only Thai? Is there a reference for that? A quick search suggests it’s not the case, but I’m no expert.

As a lowly beginner I find the lack of word boundaries in Thai frustrating but I think it’s just that I have not yet learned to think in syllables, I’m still always sounding them out in my head until I have a word I recognize, there’s no flow.

This seems like something the LLMs should be very good at. Google Translate does OK-ish while Apple just throws up its hands in frustration and refuses to translate Thai texts.

Read the Unicode standard, it covers all of these things.
How does it make it explicit? You need a dictionary to figure it out, no? Same as e.g. Japanese?
Right but such dictionaries are already built in to all major operating systems. The double-click-to-select-word interaction works well with Chinese and Japanese in all major operating systems. Without such dictionaries you can't even implement word selection.
It works until it recognizes 外国人参政権 as foreign/carrot/regime