|
|
|
|
|
by vorg
4003 days ago
|
|
> The only acceptable solution is to remove the restriction of using only four byte per sequence which would allow to encode these easily keeping all the advantages of UTF-8 I agree. Extending UTF-8 with surrogates like this is intended to be temporary, only used until the pre-2003 2.1 billion codepoint limit for UTF-8 and UTF-32 is reinstated by the Unicode Consortium. Then any software using UTF-88 can easily swap the encoding to the 1 to 6-byte sequences in "reinstated" UTF-8. This surrogation scheme is actually intended for UTF-16 to use as a second-tier surrogate scheme so it can encode the same number of codepoints as UTF-8 and UTF-32. I wrote all this under "Rationale" at the bottom of the linked page, did you read that far? Hopefully, though, UTF-16 will be on its way out when pre-2003 UTF-8 and UTF-32 are reinstated so this surrogation scheme wouldn't even see much use there. |
|
Why not directly going for the pre-2003 UTF-8 encoding ? It would even put a bit of pressure for restoring them and would show that this is the good way. It is also the only way I think to convince people to start implementing it.