|
|
|
|
|
by panpog
343 days ago
|
|
It seems plausible that this could be made efficiently doable byte-wise. For example, C3 xx could be made to uppercase to C4 xx. Unicode actually does structure its codespace to make certain properties easier to compute, but those properties are mostly related to legacy encodings, and things are designed with USC2 or UTF32 in mind, not UTF8. It’s also not clear to me that the code point is a good abstraction in the design of UTF8. Usually, what you want is either the byte or the grapheme cluster. |
|
Exactly ! That's what I understood after reading this great post https://tonsky.me/blog/unicode/
"Even in the widest encoding, UTF-32, [some grapheme] will still take three 4-byte units to encode. And it still needs to be treated as a single character. If the analogy helps, we can think of the Unicode itself (without any encodings) as being variable-length."
I tend to think it's the biggest design decision in Unicode (but maybe I just don't fully see the need and use-cases beyond emojis. Of course I read the section saying it's used in actual languages, but the few examples described could have been made with a dedicated 32 bits codepoint...)