For me, the most surprising thing is that the new PTY devices use UTF-8. Not UTF-16 or UCS-2 or weird little endian variants thereof, and not even wchar_t.
It is! But this is a very un-windows like feature, isn't it? We want this to work on other platforms with as little modification as necessary, and frankly, jumping through the wchar_t<->char hoops is a _pain_. So we'll do it for you!
Hear hear! wchar_t is a disaster. UTF-16 is terrible. I'm not at all convinced that 2^21 codepoints will be enough, so someday it'd be nice to be able to get past UTF-16 and move to UTF-8, and Windows and ECMAScript are the biggest impediments to that. Your choice of UTF-8 will tend to place UTF-8 on a level playing field in Win32.
I guess, too, that this is the end of codepages -- I doubt they'd go away, but there should be no more need to struggle with them, just use UTF-8. You'll still need a semblance of locale, for localization purposes, naturally, but all-UTF-8-all-the-time is a great simplification.
Confused, where does 2^21 code points come from and how is that related to the UTF-16 vs. UTF-8 distinction? Can't both of them encode all Unicode code points? Or are you thinking of code units perhaps, and UCS-2? Although even there I'm confused where the 2^21 came from.
UTF-16 has a limit on the size of a code point because a code point is either a single normal code unit or a pair of surrogate code units, each encoding 10 bits of the code point (I think I used the right terminology). UTF-8 has a natural extension path to up to 7-byte encodings with all the usual UTF-8 properties (first code unit indicates how many remain, other code units are recognizable as not the first).
Where are you getting this information though? I haven't worked out the bits myself yet but Wikipedia's first sentence itself says UTF-16 can encode all 1,112,064 valid code points of Unicode, which is already more than 2^(10+10) = 1,048,576.
Unicode code point space: Was 16-bit (0000 to FFFF), then 32-bit (00000000 to FFFFFFFF), and is now 21-bit (00000000 to 0010FFFF)
UTF-16: Encodes the entire 21-bit range, encoding most of the first 0000 to FFFF range as-is, and using surrogate pairs in that range to encode 00010000 to 0010FFFF. The latter range is shifted to 00000000 to 000FFFFF before encoding, which can be encoded in the 20 bits that surrogate pairs provide. This is a subtlety that one likely does not appreciate if one learns UTF-8 first and expects UTF-16 to be like it.
UTF-8: Could originally encode 00000000 to 7FFFFFFF, but since the limitation to just the first 17 planes a lot of UTF-8 codecs in the real world actually no longer contain the code for handling the longer sequences. Witness things like the UTF-8 codec in MySQL, whose 32-bit support conditional compilation switch is mentioned at https://news.ycombinator.com/item?id=17311048 .
Yes, UTF-16 can encode all currently valid Unicode codepoints, which is more than 2²⁰ but less than 2²¹. But cryptonector doesn't believe it will be enough in the future.
OTOH, UTF-8, as originally defined, can encode 2³¹ codepoints.
UTF-16 is not always 16b/codepoint. UCS-2, its ancestor and more or less what is actually used by Windows when it talks about "Unicode", was 16b/codepoint. It turned out that 64k codepoints wasn't enough for everybody, so they increased the range and added "surrogate pairs" to UCS-2 to make UTF-16. Codepoints in UTF-16 are, therefore, either 16b/codepoint or 32b/codepoint. Many systems that adopted UCS-2 claim UTF-16 compatability but in fact allow unpaired surrogate pairs (an error in UTF-16). The necessity of encoding such invalid
UTF-16 strings in UTF-8 under certain circumstances lead to the pseudo-standard known as WTF-8[0].
UTF-8 was designed (as legend has it, on the back of a napkin during a dinner break) after the increase in range and doesn't suffer from the same problem. Additionally, it's straight-forward "continuation" mechanism isn't any more difficult to deal with than surrogate pairs, and it doesn't have any endianess issues like UTF-16/UCS-2.
UTF-16 is garbage. Windows is stuck with it because it was too early an adopter of Unicode. Oh the irony. This may set Windows on a path to deprecating UTF-16 -- godspeed!
Another exciting development in moving beyond UTF-16: Microsoft is experimenting with adding a native UTF-8 string in .NET next to the existing UTF-16 string:
I've just spent some time reading through the proposals, it made for a fascinating read! It's really interesting to see the work and discussions that go into a seemingly simple feature like this.
Could you elaborate? I've been under the guise for most of my career that doubling a digit leads to huge benefits that I'm too comp-sci ignorant to understand.
Can I assume that's just subtle humor on your part?
If not:
UTF-16 is born of UCS-2 being a very poor codeset, as it was limited to the Unicode BMP, which means 2^16 codepoints, but Unicode has many more codepoints, so users couldn't have the pile-of-poo emoticon. Something had to be done, and that something was to create a variable-length (in terms of 16-bit code units) encoding using a few then-unassigned codepoints in the BMP. The result yields only a sad, pathetic, measly 2^21 codepoints, and that's just not that much. Moreover, while many codesets play well with ASCII, UTF-16 doesn't. Also, decomposed forms of Unicode glyphs necessarily involve multiple codepoints, thus multiple code units... Many programmers hate variable length text encoding because they can't do simple array indexing operations to find the nth character in a string, but with UTF-8, UTF-16, and just plain decomposition, that's a fact of life anyways. If you're going to have a variable-length codeset encoding, you might as well use UTF-8 and get all its plays-nice-with-ASCII benefits. For Latin-mostly text UTF-8 also is more efficient than UTF-16, so there is a slight benefit there.
Much of the rest of the non-Windows, non-ECMAScript world has settled on UTF-8, and that's a very very good thing.
No humor whatsoever. Thank you for the explanation! I'm an ops person who knows python/golang to a dangerous extent and have never gone out of my way to understand the UTF reasonings. Your post intrigued me and made me want to ask why you felt that way. This will make me sound horrendously ignorant to someone of the likes of someone such as yourself but I'm here to learn.
UTF-16 has a bit of a funky design (using four byte/two code unit surrogate pairs to encode code points outside the basic multilingual plane) that ultimately restricts Unicode (if compatibility is to be maintained with UTF-16, at least) to 17 planes, or 2^20 code points (about 1 million).
UTF-8 uses a variable length encoding that allows for more characters-- if restricted to four bytes, it allows for 2^21 total code points; it's designed to eventually allow for 2^31 code points, which works out to about 2 billion code points that can be expressed.
(Granted, this is all hypothetical-- Unicode isn't even close to filling all of the space that UTF-16 allows; there aren't enough known writing systems yet to be encoded to fill all of the remaining Unicode planes (3-13 of 17 are all still unassigned). But UTF-16's still nonstandard (most of the world's standardized on UTF-8) and kind of ugly, so the sooner it goes away, the better.)
That is a bit misleading to the point of error, on several points:
* Your timeline is backwards. UTF-8 was designed for a 31-bit code space. Far from that being its future, that is its past. In the 21st century it was explicitly reduced from 31-bit capable to 21 bits.
* UTF-16 is just as standard as UTF-8 is, it being standardized by the same people in the same places.
* 17 planes is 21 bits; it is 16 planes that is 20 bits.
Joel Spolsky's essay "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" is an excellent read:
Wikipedia says: "UTF-8 requires 8, 16, 24 or 32 bits (one to four octets) to encode a Unicode character, UTF-16 requires either 16 or 32 bits to encode a character"
> I've been under the guise for most of my career that doubling a digit leads to huge benefits that I'm too comp-sci ignorant to understand.
I was confused about this for years, too. But it turns out it's just a problem of bad naming. Happens more in this industry than we'd like to admit.
As other explained, it boils down to UTF-16 being 16-bit, and UTF-8 being anything from 8- to 32-bit. It should have been named UTF-V (from "variable") or something, but here we are.
UTF-16 is a variable-length encoding using up to two code units which each are 16-bits wide.
UTF-8 is a variable-length encoding using up to 4 code units (though it used to be up to 6, and could again be up to 6) each of which are 8-bits wide.
Both, UTF-16 and UTF-8 are variable-length encodings!
UTF-32 is not variable-length, but even so, the way Unicode works a character like ´ (á) can be written in two different ways, one of which requires one codepoint and one of which requires two (regardless of encoding), while ṻ (LATIN SMALL LETTER U WITH MACRON AND DIAERESIS) can be written in up to five different ways requiring from one to three different codepoints (regardless of encoding).
Not every character has a one-codepoint representation in Unicode, or at least not every character has a canonically-pre-composed one-codepoint representation in Unicode.
Therefore, many characters in Unicode can be expected to be written in multiple codepoints regardless of encoding. Therefore all programmers dealing with text need to be prepared for being unable to do an O(1) array index operation to get at the nth character of a string.
(In UTF-32 you can do an O(1) array index operation to get to the nth codepoint, not character, but one is usually only ever interested in getting the nth character.)
For completeness sake: the confusing part is that there is (used to be) a constant-length encoding that uses the exact same codepoints as UTF-16, but doesn't allow the variable-length extensions. That encoding is called UCS-2 and although deprecated, is the reason why many people think UTF-16 is constant-length.