|
|
|
|
|
by chch
2230 days ago
|
|
Doing a bit more deep diving into the ICU code, it looks like the source code for the Break engine (used by Chinese, Japanese, and Korean) is here:
https://github.com/unicode-org/icu/blob/778d0a6d1d46faa724ea... and then according to the LICENSE file[1], the dictionary : # The word list in cjdict.txt are generated by combining three word lists
# listed below with further processing for compound word breaking. The
# frequency is generated with an iterative training against Google web
# corpora.
#
# * Libtabe (Chinese)
# - https://sourceforge.net/project/?group_id=1519
# - Its license terms and conditions are shown below.
#
# * IPADIC (Japanese)
# - http://chasen.aist-nara.ac.jp/chasen/distribution.html
# - Its license terms and conditions are shown below.
#
It's interesting to see some of the other techniques used in that engine, such as a special function to figure out the weights of potential katakana word splits.[1] https://github.com/unicode-org/icu/blob/6417a3b720d8ae3643f7... |
|