| I swear there should be some rule or law about how Unicode articles will inevitably muddle code units/points / grapheme clusters / bytes together. > String length is typically determined by counting codepoints. > This means that surrogate pairs would count as two characters. If you were counting code points, a surrogate pair would be 1. If it's two, you're counting code units. > Combining multiple diacritics may be stacked over the same character. a + ̈ == ̈a, increasing length, while only producing a single character. Not if you're counting code points or code units, which would both produce an answer of "2", and that's a great example of why you shouldn't count with either. The dark blue on black in tables is next to invisible. And then to put that on white on the alternate rows is just eyeball murder. > Since there are over 1.1 million UTF-8 glphys (sic) UTF-8 glyphs twitch; aside from that, I'm really curious how they got that number. In some ways, a font has it easy; my understanding is that modern font formats can do one glyph for acute accent, one glyph for all the vowels/letters, and then compose the glyphs into arrangements for having them combined. (IDK if those are also "glyphs" to the font or not.) But it's less drawing, at least. OTOH, some characters have >1 appearance/"image", AIUI. |
> String length is typically determined by counting codepoints.
That depends entirely on what "strings" you are talking about.
In C/Go/Rust/Ruby, char*/string/std::string::String/String is bytes.
In Java/JavaScript, java.lang.String/String is UTF-16 code units.
In Python 3, str is code points.
In Swift, String is extended grapheme clusters.
In Haskell, there are various different "string" types in common use.
And in C++, std::basic_string is a generic container for whatever element type you want. (std::string specialization being for bytes.)
EDIT: Clarified that I don't disagree with parent comment; merely pointing out additional less-than-precise language.