Hacker News new | ask | show | jobs
by yorwba 3174 days ago
Do you speak Chinese? NLP for Chinese comes with it's own set of challenges.

Because word boundaries aren't marked, you need word segmentation, which requires understanding in lots of cases. (Can't tell where a word ends and the next one begins if you don't know what they mean.)

Many words have a literal meaning and a metaphorical one (e.g. 纠结 can mean both "tangled" and "confused").

Different synonyms being used for the same concept are common as well, especially when you contrast formal vs. informal writing.

Misspellings can happen too, where a character is substituted with a similar character that has the same pronunciation. Sometimes completely different characters are used as a kind of pun.

Grepping is less likely to yield false positives (unless you're looking for a single-character word that can appear in compounds), but there is no easy way to do fuzzy matching to account for misspelled words.