Hacker News new | ask | show | jobs
by chch 2230 days ago
Doing a bit more deep diving into the ICU code, it looks like the source code for the Break engine (used by Chinese, Japanese, and Korean) is here: https://github.com/unicode-org/icu/blob/778d0a6d1d46faa724ea...

and then according to the LICENSE file[1], the dictionary :

   #  The word list in cjdict.txt are generated by combining three word lists
   # listed below with further processing for compound word breaking. The
   # frequency is generated with an iterative training against Google web
   # corpora.
   #
   #  * Libtabe (Chinese)
   #    - https://sourceforge.net/project/?group_id=1519
   #    - Its license terms and conditions are shown below.
   #
   #  * IPADIC (Japanese)
   #    - http://chasen.aist-nara.ac.jp/chasen/distribution.html
   #    - Its license terms and conditions are shown below.
   #

It's interesting to see some of the other techniques used in that engine, such as a special function to figure out the weights of potential katakana word splits.

[1] https://github.com/unicode-org/icu/blob/6417a3b720d8ae3643f7...