Hacker News new | ask | show | jobs
by deadfoxygrandpa 1139 days ago
i think you're on to something about cantonese, but it's also true of mandarin. segmentation of words in chinese in general seems inherently messier than segmentation in english. also look at stuff like abbreviations: is 北大 one word? is it an abbreviation for 北京大学 the same way Caltech is an abbreviation for california institute of technology? is it just two single character words, each of which is an abbreviation? i think its much less clear than english
2 comments

Segmentation in Mandarin is easier due to tendency of the language to use 2+ characters for words. With a high quality wordlist you will go a long way.

The problem with proper nouns is that they don't end up in dictionaries, same with slang and other terms that for reasons don't end up in dictionaries.

The additional problem with Cantonese is that there's a larger class of words where the constituent characters can move around as if they were words themselves. Even for a native speaker with some experience in lexicography, it can be difficult to determine word boundaries as there are many cases where a word with characters X+Y can be interpreted as just word X and word Y with some idiomatic meaning. This issue is more pronounced in Cantonese because there are more single character words in active use.

I've actually done this before. My experience is that naive segmentation on Mandarin text with wordlist is probably 80+% accurate, while using the same algorithm in Cantonese text (with cantonese wordlist) will definitely end up "wtf".

The same problem exists in Japanese FWIW, whose speakers like to make the same sorts of abbreviations despite not having a bisyllabic meter like Mandarin does. Japanese is somewhat helped by having multiple orthographies, however.