|
|
|
|
|
by hsivonen
2476 days ago
|
|
> Your language's length function is probably just returning the number of unicode codepoints in the string. The article didn't say that! "Number of Unicode code points" in a string is ambiguous, because surrogates and astral characters both are code points, so it's ambiguous if a surrogate pair counts as two code points or one. (It unambiguously counts as two UTF-16 code units and as one Unicode scalar value.) The article presented four kinds of programming language-reported string lengths: 1. Length is number of UTF-8 code units.
2. Length is number of UTF-16 code units.
3. Length is number of UTF-32 code units, which is the same as the number of Unicode scalar values.
4. Length is number of extended grapheme clusters. |
|