Hacker News new | ask | show | jobs
by bllguo 3174 days ago
I'm sure there's research on this already, but just a thought - it seems to me like it would be easier to do NLP on Chinese, vs English. (I speak both). Much easier to recognize the author's meaning in Chinese characters, and more often than not people use the same words to describe something.. vs. the frankly disgusting mess that is English. Misspellings abound, the same words can mean all kinds of things, grepping for strings is hard because sequences of letters aren't unique, etc.
1 comments

Do you speak Chinese? NLP for Chinese comes with it's own set of challenges.

Because word boundaries aren't marked, you need word segmentation, which requires understanding in lots of cases. (Can't tell where a word ends and the next one begins if you don't know what they mean.)

Many words have a literal meaning and a metaphorical one (e.g. 纠结 can mean both "tangled" and "confused").

Different synonyms being used for the same concept are common as well, especially when you contrast formal vs. informal writing.

Misspellings can happen too, where a character is substituted with a similar character that has the same pronunciation. Sometimes completely different characters are used as a kind of pun.

Grepping is less likely to yield false positives (unless you're looking for a single-character word that can appear in compounds), but there is no easy way to do fuzzy matching to account for misspelled words.