Hacker News new | ask | show | jobs
by hsivonen 2476 days ago
> Your language's length function is probably just returning the number of unicode codepoints in the string.

The article didn't say that!

"Number of Unicode code points" in a string is ambiguous, because surrogates and astral characters both are code points, so it's ambiguous if a surrogate pair counts as two code points or one. (It unambiguously counts as two UTF-16 code units and as one Unicode scalar value.)

The article presented four kinds of programming language-reported string lengths:

1. Length is number of UTF-8 code units. 2. Length is number of UTF-16 code units. 3. Length is number of UTF-32 code units, which is the same as the number of Unicode scalar values. 4. Length is number of extended grapheme clusters.