|
|
|
|
|
by TeMPOraL
2863 days ago
|
|
> I've been under the guise for most of my career that doubling a digit leads to huge benefits that I'm too comp-sci ignorant to understand. I was confused about this for years, too. But it turns out it's just a problem of bad naming. Happens more in this industry than we'd like to admit. As other explained, it boils down to UTF-16 being 16-bit, and UTF-8 being anything from 8- to 32-bit. It should have been named UTF-V (from "variable") or something, but here we are. |
|
UTF-8 is a variable-length encoding using up to 4 code units (though it used to be up to 6, and could again be up to 6) each of which are 8-bits wide.
Both, UTF-16 and UTF-8 are variable-length encodings!
UTF-32 is not variable-length, but even so, the way Unicode works a character like ´ (á) can be written in two different ways, one of which requires one codepoint and one of which requires two (regardless of encoding), while ṻ (LATIN SMALL LETTER U WITH MACRON AND DIAERESIS) can be written in up to five different ways requiring from one to three different codepoints (regardless of encoding).
Not every character has a one-codepoint representation in Unicode, or at least not every character has a canonically-pre-composed one-codepoint representation in Unicode.
Therefore, many characters in Unicode can be expected to be written in multiple codepoints regardless of encoding. Therefore all programmers dealing with text need to be prepared for being unable to do an O(1) array index operation to get at the nth character of a string.
(In UTF-32 you can do an O(1) array index operation to get to the nth codepoint, not character, but one is usually only ever interested in getting the nth character.)