Hacker News new | ask | show | jobs
by closed 1454 days ago
As a person learning cantonese, pycantonese and cantonese.sheik.co.uk were so helpful!

The trickiest thing to explain to people is that Cantonese is a diaglossika. It's spoken, but to write you would use mandarin. So there aren't a ton of giant written language corpuses to train on.

AFAICT that's why tools like this often use simple heuristics to do word segmentation. I remember going down a deep rabbit hole, before finally just packaging a small tool to do cantonese word segmentation using the A* algorithm:

https://github.com/machow/cantocut

3 comments

Not 100% Accurate.

A lot of us who speak Cantonese as mother tongue use "Cantonese" as written languages on online forums / medias. Written Mandarin normally appear on news or more formal environment.

I thought the same thing as you, but spurred by the other commenters, found this: https://en.wikipedia.org/wiki/Written_Cantonese
This is a minor point, but I'm cantonese and aware it can be written. (This is why I linked to a tool to segment written cantonese :).
Cantonese can be written, it is just that the character is not as common as those you see in Mandarin.