|
|
|
|
|
by jekub
4003 days ago
|
|
> see https://github.com/gavingroovygrover/utf88 This is the most stupid way to extend UTF-8 I've seen. The only acceptable solution is to remove the restriction of using only four byte per sequence which would allow to encode these easily keeping all the advantages of UTF-8. Doing it like they does add an additional layer of encoding and so a lot of complexity a room for bugs. It was probably made for compatibility but a lot of software will do bad thing with these new "surrogate" pairs so this solution is not really more compatible in practice. And updating software to handle UTF-8 sequence longer than 4 bytes is a lot more easier than updating them to handle such encoding. |
|
I agree. Extending UTF-8 with surrogates like this is intended to be temporary, only used until the pre-2003 2.1 billion codepoint limit for UTF-8 and UTF-32 is reinstated by the Unicode Consortium. Then any software using UTF-88 can easily swap the encoding to the 1 to 6-byte sequences in "reinstated" UTF-8. This surrogation scheme is actually intended for UTF-16 to use as a second-tier surrogate scheme so it can encode the same number of codepoints as UTF-8 and UTF-32. I wrote all this under "Rationale" at the bottom of the linked page, did you read that far?
Hopefully, though, UTF-16 will be on its way out when pre-2003 UTF-8 and UTF-32 are reinstated so this surrogation scheme wouldn't even see much use there.