|
|
|
|
|
by btn
4512 days ago
|
|
One of the major difficulties with Unicode handling is not just that there are poor implementations out there with legacy baggage, but a lot of poor advice as well (or well-meaning advice that seems correct, but misses some corner case or some language). For example, this article wants to count "graphemes", and the author goes through three versions of an algorithm to account for surrogate pairs and various combining marks. All seems well in the test cases the author shows, but combining marks are only one class of codepoints that can join to form a grapheme, and the algorithm will fail for other grapheme clusters such as 'நி' (Tamil letter NA + Tamil Vowel Sign I), or Hangul made of conjoining Jamo (such as '깍': 'ᄁ' + 'ᅡ' + 'ᆨ'), or other control characters. Luckily, the Unicode Technical Committee has figured this out for you, and UAX#29 provides an algorithm for determining grapheme cluster boundaries [1]. Yes, it's long and technical, it has many cases (and exceptions) to handle, and it can't be expressed compactly in two lines of JavaScript; but it will give you a well-defined and understood answer for all scripts in Unicode. [1] http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Bounda... |
|
I read dozens (hundreds?) of Unicode-related blogpost for many different languages, with long debates and discussions about the hurdles of counting graphemes, but they always forget to explain why one should need it; it's just assumed that it's important or interesting. This specific post just says: "Let's say you want to count the number of symbols in a given string, for example. How would you go about it?" and then go into a multi-page explanation, which is even incomplete (as you correctly noticed).
I can't remember many cases in which it's been useful to count graphemes, in my programming activity. I usually need to either:
1) count the number of bytes of the Unicode encoding I'm using / going to use, for the purpose of low-level stuff like buffers/sockets/memory/etc. 2) ask a graphic library to tell me how big the string will be on the screen, in pixels (with the given fonts, layout, hints, and whatnot).
Counting graphemes only sounds useful for things like command-line terminal; e.g.: if I were to make a command-line user interface (ala getopt()) which automatically wordwraps text in the usage screen at the 80-th column, I would need to count graphemes, in the unlikely case I had to support Tamil or Korean for such a specialistic case.
tl;dr: counting grapheme is a very complicated problem you probably don't need to ever solve.