|
|
|
|
|
by BorgHunter
4787 days ago
|
|
Even this definition can get hairy, though. What is a character? Is 'á' one character or two? Most human beings would say one, but in actuality I formed it with an 'a' (U+0061) and a combining acute accent (U+0301): Two separate code points. But you can also get the same result with 'á' (U+00E1); this is not true of all combining character combinations. In the past, I've had to deal with horrible mashups of fixed-byte-length columns in flat text files with UTF-8 bolted onto it. In Java, no less. Trying to figure out how to deal with all the edge cases (how do you truncate a string when the boundary is between a "normal" character and a combining character?) was an endless parade of the bizarre. Strings are hard, fundamentally. |
|
As long as you only wander around one of those levels (grapheme, code point, code unit, byte) all is (fairly) easy, but once you deal with multiple levels mistakes almost invariably creep in and you start treating code points as graphemes or code units as code points, etc. Fun source of all kinds of bugs :-)
So yes, text in general is hard. And Han, Hangul and the Japanese scripts are probably among the easiest scripts to support in software :-)