| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by HeraldEmbar 2474 days ago

tl;dr: Unicode codepoints don't have a 1-to-1 relationship with characters that are actually displayed. This has always been the case (due to zero-width characters, accent modifiers that go after another character, Hangul etc.) but has recently got complicated by the use of ZWJ (zero-width joiner) to make emojis out of combinations of other emojis, modifiers for skin colour, and variation selectors. There is also stuff like flag being made out of two characters, e.g. flag_D + flag_E = German flag.

Your language's length function is probably just returning the number of unicode codepoints in the string. You need to a function that computes the number of 'extended grapheme clusters' if you want to get actually displayed characters. And if that function is out of date, it might not handle ZWJ and variation selectors properly, and still give you a value of 2 instead of 1. Make sure your libraries are up to date.

Also, if you are writing a command line tool, you need to use a library to work out how many 'columns' a string will occupy for stuff like word wrapping, truncation etc. Chinese and Japanese characters take up two columns, many characters take up 0 columns, and all the above (emoji crap) can also affect the column count.

In short the Unicode standard has gotten pretty confusing and messy!

1 comments

hsivonen 2474 days ago

> Your language's length function is probably just returning the number of unicode codepoints in the string.

The article didn't say that!

"Number of Unicode code points" in a string is ambiguous, because surrogates and astral characters both are code points, so it's ambiguous if a surrogate pair counts as two code points or one. (It unambiguously counts as two UTF-16 code units and as one Unicode scalar value.)

The article presented four kinds of programming language-reported string lengths:

1. Length is number of UTF-8 code units. 2. Length is number of UTF-16 code units. 3. Length is number of UTF-32 code units, which is the same as the number of Unicode scalar values. 4. Length is number of extended grapheme clusters.