|
|
|
|
|
by Hadjimina
865 days ago
|
|
Thanks! The tricky bit is to make this work in different languages where the "space" is not used to separate the different words, such as Chinese. We should implement a real Chinese lemmatizer there to chunk the words. Not sure if you saw it, but we already have pinyin in there. If you open up the settings and tick "show pronunciations" they will appear above the chat messages. |
|
Or find all substrings that are listed in a dictionary (≈everyone uses cc-cedict https://www.mdbg.net/chinese/dictionary?page=cc-cedict ) and give translations for all of them. That way, the user won't be limited to any particular chunking granularity, which is always a finicky aspect of word segmenters to fine-tune.