Hacker News new | ask | show | jobs
by nereye 1164 days ago
> Korean has couple of dozen of symbols in its alphabet.

While that is true (14 consonants, 10 vowels [0]), there are encodings for Korean that encode at the syllable level (where each syllable contains one or two consonants and one vowel) and the combinations for syllables are over 10000 (e.g. 11172 code points listed in Unicode, see [1]).

[0] in practice, more, both to cater for both modern and obsolete forms as well to distinguish the forms based on their position, i.e. with separate encodings for leading vs trailing consonants etc.).

[1] https://en.wikipedia.org/wiki/Hangul_Syllables

1 comments

In a bizarre coincidence I've just been working on code handling Korean cluster breaks and while it's true there's a lot of codepoints, the rules for handling them are mathematically trivial when considered as codepoint values.

(But I guess I also won't be surprised if the OpenAI guys can't write algorithms worth spit if it's not a large matrix multiplication.)