Hacker News new | ask | show | jobs
by zadjii 2865 days ago
It is! But this is a very un-windows like feature, isn't it? We want this to work on other platforms with as little modification as necessary, and frankly, jumping through the wchar_t<->char hoops is a _pain_. So we'll do it for you!
1 comments

Hear hear! wchar_t is a disaster. UTF-16 is terrible. I'm not at all convinced that 2^21 codepoints will be enough, so someday it'd be nice to be able to get past UTF-16 and move to UTF-8, and Windows and ECMAScript are the biggest impediments to that. Your choice of UTF-8 will tend to place UTF-8 on a level playing field in Win32.

I guess, too, that this is the end of codepages -- I doubt they'd go away, but there should be no more need to struggle with them, just use UTF-8. You'll still need a semblance of locale, for localization purposes, naturally, but all-UTF-8-all-the-time is a great simplification.

Confused, where does 2^21 code points come from and how is that related to the UTF-16 vs. UTF-8 distinction? Can't both of them encode all Unicode code points? Or are you thinking of code units perhaps, and UCS-2? Although even there I'm confused where the 2^21 came from.
UTF-16 has a limit on the size of a code point because a code point is either a single normal code unit or a pair of surrogate code units, each encoding 10 bits of the code point (I think I used the right terminology). UTF-8 has a natural extension path to up to 7-byte encodings with all the usual UTF-8 properties (first code unit indicates how many remain, other code units are recognizable as not the first).
Where are you getting this information though? I haven't worked out the bits myself yet but Wikipedia's first sentence itself says UTF-16 can encode all 1,112,064 valid code points of Unicode, which is already more than 2^(10+10) = 1,048,576.
Unicode code point space: Was 16-bit (0000 to FFFF), then 32-bit (00000000 to FFFFFFFF), and is now 21-bit (00000000 to 0010FFFF)

UTF-16: Encodes the entire 21-bit range, encoding most of the first 0000 to FFFF range as-is, and using surrogate pairs in that range to encode 00010000 to 0010FFFF. The latter range is shifted to 00000000 to 000FFFFF before encoding, which can be encoded in the 20 bits that surrogate pairs provide. This is a subtlety that one likely does not appreciate if one learns UTF-8 first and expects UTF-16 to be like it.

UTF-8: Could originally encode 00000000 to 7FFFFFFF, but since the limitation to just the first 17 planes a lot of UTF-8 codecs in the real world actually no longer contain the code for handling the longer sequences. Witness things like the UTF-8 codec in MySQL, whose 32-bit support conditional compilation switch is mentioned at https://news.ycombinator.com/item?id=17311048 .

> a lot of UTF-8 codecs in the real world actually no longer contain the code for handling the longer sequences.

Not exactly. A conforming decoder MUST reject them.

MySQL’s problem is that, by default, it can’t even handle all valid code points.

I don't see anything wrong with what you're saying, but I still don't get how it explains the original comment I replied to [1]:

> I'm not at all convinced that 2^21 codepoints will be enough, so someday it'd be nice to be able to get past UTF-16 and move to UTF-8

UTF-16 currently uses up to 2 16-bit code units per code point, whereas UTF-8 uses up to 4 8-bit code units per code point, and the latter wastes more bits for continuation than the former. How is "getting past UTF-16 and moving to UTF-8" supposed to increase the number of code points we can represent, as claimed above? If anything, UTF-16 wastes fewer bits in the current maximum number of code units, so it should have more room for expansion without increasing the number of code units.

[1] https://news.ycombinator.com/item?id=17771351

Yes, UTF-16 can encode all currently valid Unicode codepoints, which is more than 2²⁰ but less than 2²¹. But cryptonector doesn't believe it will be enough in the future.

OTOH, UTF-8, as originally defined, can encode 2³¹ codepoints.

Utf-16 is always 16 bit, utf-8 is variable, can go from 8 to 32 as needed...

This does make coding for utf-8 harder, but when it works is really wonderful stuff.

UTF-16 is not always 16b/codepoint. UCS-2, its ancestor and more or less what is actually used by Windows when it talks about "Unicode", was 16b/codepoint. It turned out that 64k codepoints wasn't enough for everybody, so they increased the range and added "surrogate pairs" to UCS-2 to make UTF-16. Codepoints in UTF-16 are, therefore, either 16b/codepoint or 32b/codepoint. Many systems that adopted UCS-2 claim UTF-16 compatability but in fact allow unpaired surrogate pairs (an error in UTF-16). The necessity of encoding such invalid UTF-16 strings in UTF-8 under certain circumstances lead to the pseudo-standard known as WTF-8[0].

UTF-8 was designed (as legend has it, on the back of a napkin during a dinner break) after the increase in range and doesn't suffer from the same problem. Additionally, it's straight-forward "continuation" mechanism isn't any more difficult to deal with than surrogate pairs, and it doesn't have any endianess issues like UTF-16/UCS-2.

[0]https://simonsapin.github.io/wtf-8/

UCS-2 is always 2 octets, UTF-16 allows for so-called surrogate pairs which expands the bits available making it variable size.
Look up UCS-2 that I mentioned in my comment.