|
|
|
|
|
by closed
1454 days ago
|
|
As a person learning cantonese, pycantonese and cantonese.sheik.co.uk were so helpful! The trickiest thing to explain to people is that Cantonese is a diaglossika. It's spoken, but to write you would use mandarin. So there aren't a ton of giant written language corpuses to train on. AFAICT that's why tools like this often use simple heuristics to do word segmentation. I remember going down a deep rabbit hole, before finally just packaging a small tool to do cantonese word segmentation using the A* algorithm: https://github.com/machow/cantocut |
|
A lot of us who speak Cantonese as mother tongue use "Cantonese" as written languages on online forums / medias. Written Mandarin normally appear on news or more formal environment.