| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yuubi 2906 days ago
	UTF-16 has a limit on the size of a code point because a code point is either a single normal code unit or a pair of surrogate code units, each encoding 10 bits of the code point (I think I used the right terminology). UTF-8 has a natural extension path to up to 7-byte encodings with all the usual UTF-8 properties (first code unit indicates how many remain, other code units are recognizable as not the first).

1 comments

mehrdadn 2906 days ago

Where are you getting this information though? I haven't worked out the bits myself yet but Wikipedia's first sentence itself says UTF-16 can encode all 1,112,064 valid code points of Unicode, which is already more than 2^(10+10) = 1,048,576.

link

JdeBP 2905 days ago

Unicode code point space: Was 16-bit (0000 to FFFF), then 32-bit (00000000 to FFFFFFFF), and is now 21-bit (00000000 to 0010FFFF)

UTF-16: Encodes the entire 21-bit range, encoding most of the first 0000 to FFFF range as-is, and using surrogate pairs in that range to encode 00010000 to 0010FFFF. The latter range is shifted to 00000000 to 000FFFFF before encoding, which can be encoded in the 20 bits that surrogate pairs provide. This is a subtlety that one likely does not appreciate if one learns UTF-8 first and expects UTF-16 to be like it.

UTF-8: Could originally encode 00000000 to 7FFFFFFF, but since the limitation to just the first 17 planes a lot of UTF-8 codecs in the real world actually no longer contain the code for handling the longer sequences. Witness things like the UTF-8 codec in MySQL, whose 32-bit support conditional compilation switch is mentioned at https://news.ycombinator.com/item?id=17311048 .

link

amluto 2905 days ago

> a lot of UTF-8 codecs in the real world actually no longer contain the code for handling the longer sequences.

Not exactly. A conforming decoder MUST reject them.

MySQL’s problem is that, by default, it can’t even handle all valid code points.

link

JdeBP 2905 days ago

They reject them by not having a code path that successfully decodes them.

link

mehrdadn 2905 days ago

I don't see anything wrong with what you're saying, but I still don't get how it explains the original comment I replied to [1]:

> I'm not at all convinced that 2^21 codepoints will be enough, so someday it'd be nice to be able to get past UTF-16 and move to UTF-8

UTF-16 currently uses up to 2 16-bit code units per code point, whereas UTF-8 uses up to 4 8-bit code units per code point, and the latter wastes more bits for continuation than the former. How is "getting past UTF-16 and moving to UTF-8" supposed to increase the number of code points we can represent, as claimed above? If anything, UTF-16 wastes fewer bits in the current maximum number of code units, so it should have more room for expansion without increasing the number of code units.

[1] https://news.ycombinator.com/item?id=17771351

link

JdeBP 2905 days ago

If I had intended to explain that, I wouldn't be replying to your comment about your not working out the bits with an explanation that works out the bits for you, and shows that it really is capable of encoding all 17 planes even though surrogate pairs have only 20 bits.

And as you can see, if you do work out the bits, you find that cryptonector is wrong, since UTF-8 (as it has been standardized from almost the start of the 21st century, and as codecs in the real world have taken to implementing it since) encodes no more bits than UTF-16 does. It's 21 bits for both.

link

cryptonector 2905 days ago

No, UTF-8 could, and used to, encode much more than 21 bits of codepoint space. It has been artificially limited to better match UTF-16 -- UTF-16's limits are not artificial but fundamental. If some day we need more bits, we'll simply obsolete UTF-16 and drop those limits on UTF-8. MSFT seems to be taking steps to put UTF-8 on a level or even higher playing field than UTF-16. We should welcome this.

link

mehrdadn 2905 days ago

It seemed like you were trying to correct my comment, but everything you said seemed to support what I was saying, so I thought maybe you were trying to continue the initial discussion... I guess not.

With regards to the comment then: the range downshifting you mentioned is merely a step in the encoding process though -- the code point is still whatever it was. If you read parent comment, it had claimed that, in a surrogate pair, each of the 2 code units encodes 10 bits of the code point... but that would be missing 1 bit when the code points need 21 bits to be represented. That's all I was saying there. The extra bit indicating that it's in fact a surrogate pair isn't some kind of implicit dummy bit that you can pretend isn't encoding anything -- if it wasn't there then clearly it wouldn't be encoding the code point for a surrogate pair anymore.

link

jwilk 2905 days ago

Yes, UTF-16 can encode all currently valid Unicode codepoints, which is more than 2²⁰ but less than 2²¹. But cryptonector doesn't believe it will be enough in the future.

OTOH, UTF-8, as originally defined, can encode 2³¹ codepoints.

link