|
|
|
|
|
by imron
492 days ago
|
|
Nice work OP. I’ve done a fair amount of Chinese language segmentation programming - and yeah it’s not easy, especially as you reach for higher levels of accuracy. You need to put in significant amounts of effort just for less than a few % point increases in accuracy. For my own tools which focus on speed (and used for finding frequently used words in large bodies of text) I ended up opting for a first longest match algorithm. It has a relatively high error rate, but it’s acceptable if you’re only looking for the first few hundred frequently used words. What segmented are you using, or have you developed your own? |
|
I'm using Jieba[0] because it hits a nice balance of fast and accurate. But I'm initializing it with a custom dictionary (~800k entries), and have added several layers of heuristic post-segmentation. For example, Jieba tends to split up chengyu into two words, but I've decided they should be displayed as a single word, since chengyu are typically a single entry in dictionaries.
[0] https://github.com/fxsjy/jieba