Hacker News new | ask | show | jobs
by marshallbananas 1689 days ago
I spent a few years developing a Japanese-English dictionary that had searchable example sentences. Full text indexing for Japanese is a nightmare. I used MeCab, Kumon, and Kuromoji for morphological analysis and tokenization, you should check them out. I played a bit with Chinese and it was relatively easy (compared to Japanese). Korean I suppose is somewhere in-between those two.

AFAIK there is nothing out there for East Asian languages that works as good as their romanized counterparts. They work pretty ok with text book, perfect grammar, and easy kanji material. They fall apart completely on casual human text/speech.

Do not attempt to solve this problem yourself! I'm guessing only the likes of Google and ML experts will be able to tackle this.