Hacker News new | ask | show | jobs
by JdeBP 2864 days ago
If I had intended to explain that, I wouldn't be replying to your comment about your not working out the bits with an explanation that works out the bits for you, and shows that it really is capable of encoding all 17 planes even though surrogate pairs have only 20 bits.

And as you can see, if you do work out the bits, you find that cryptonector is wrong, since UTF-8 (as it has been standardized from almost the start of the 21st century, and as codecs in the real world have taken to implementing it since) encodes no more bits than UTF-16 does. It's 21 bits for both.

2 comments

No, UTF-8 could, and used to, encode much more than 21 bits of codepoint space. It has been artificially limited to better match UTF-16 -- UTF-16's limits are not artificial but fundamental. If some day we need more bits, we'll simply obsolete UTF-16 and drop those limits on UTF-8. MSFT seems to be taking steps to put UTF-8 on a level or even higher playing field than UTF-16. We should welcome this.
It seemed like you were trying to correct my comment, but everything you said seemed to support what I was saying, so I thought maybe you were trying to continue the initial discussion... I guess not.

With regards to the comment then: the range downshifting you mentioned is merely a step in the encoding process though -- the code point is still whatever it was. If you read parent comment, it had claimed that, in a surrogate pair, each of the 2 code units encodes 10 bits of the code point... but that would be missing 1 bit when the code points need 21 bits to be represented. That's all I was saying there. The extra bit indicating that it's in fact a surrogate pair isn't some kind of implicit dummy bit that you can pretend isn't encoding anything -- if it wasn't there then clearly it wouldn't be encoding the code point for a surrogate pair anymore.