| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vorg 4050 days ago

> The only acceptable solution is to remove the restriction of using only four byte per sequence which would allow to encode these easily keeping all the advantages of UTF-8

I agree. Extending UTF-8 with surrogates like this is intended to be temporary, only used until the pre-2003 2.1 billion codepoint limit for UTF-8 and UTF-32 is reinstated by the Unicode Consortium. Then any software using UTF-88 can easily swap the encoding to the 1 to 6-byte sequences in "reinstated" UTF-8. This surrogation scheme is actually intended for UTF-16 to use as a second-tier surrogate scheme so it can encode the same number of codepoints as UTF-8 and UTF-32. I wrote all this under "Rationale" at the bottom of the linked page, did you read that far?

Hopefully, though, UTF-16 will be on its way out when pre-2003 UTF-8 and UTF-32 are reinstated so this surrogation scheme wouldn't even see much use there.

1 comments

jekub 4050 days ago

But "temporary" is a thing who exists only in theory. In practice its always never or (almost) forever. As soon as a few applications start using this "new" form of UTF-8, some of them may have to keep supporting it forever.

Why not directly going for the pre-2003 UTF-8 encoding ? It would even put a bit of pressure for restoring them and would show that this is the good way. It is also the only way I think to convince people to start implementing it.

vorg 4050 days ago

> As soon as a few applications start using this "new" form of UTF-8, some of them may have to keep supporting it forever

Not if it's used through a 3rd-party library such as the Go-implementation of UTF-88 I've provided.

> Why not directly going for the pre-2003 UTF-8 encoding ? It would even put a bit of pressure for restoring them

Because it's not a valid encoding under the current scheme, whereas using surrogates with UTF-8 is, using as it does the 2 private use planes to implement the surrogates. The goal is for restoration by the Unicode Consortium, but based on their public utterances it's not going to happen easily or quickly, and in the meantime we need an encoding that's valid under the current scheme because it may need to be used for 10 or 20 years. Of course I could have used UTF-16 with a doubly-directed surrogate system but that would be even more error-prone, and I expect whatever 2nd-level surrogate system is eventually provided with UTF-16 will be legally available with UTF-8 and UTF-16 anyway.

UTF-88 is an attempt to showcase both a surrogation scheme implementable in current UTF-16 and the fact that UTF-8 is the best encoding.