| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ygra 4787 days ago

In that case you “only” have to know what you're actually after: Either a grapheme (a character in the human sense) or a code point (a character in the Unicode sense) – well, and then there is the code unit which can be a “character” in the programming language sense but it's best not to go there, lest you want to fall into plenty of traps.

As long as you only wander around one of those levels (grapheme, code point, code unit, byte) all is (fairly) easy, but once you deal with multiple levels mistakes almost invariably creep in and you start treating code points as graphemes or code units as code points, etc. Fun source of all kinds of bugs :-)

So yes, text in general is hard. And Han, Hangul and the Japanese scripts are probably among the easiest scripts to support in software :-)