Hacker News new | ask | show | jobs
by jeberle 285 days ago
UTF-16 arguably is Unicode 2.0+. It's how the code point address space is defined. Code points are either 1 or 2 16-bit code units. Easy. Compare w/ UTF-8 where a code point may be 1, 2, 3, or 4 8-bit code units.

UTF-16 is annoying, but it's far from the biggest design failure in Unicode.

4 comments

We can argue about "biggest" all day long but UTF-16 is a huge design failure because it made a huge chunk of the lower Unicode space unusable, thereby making better encodings like UTF-8 that could easily represent those code points less efficient. This layer-violating hack should have made it clear that UTF-16 was a bad idea from the start.

Then there is also the issue that technically there is no such thing as UTF-16, instead you need to distinguish UTF-16LE and UTF-16BE. Even though approximately no one uses the latter we still can't ignore it and have to prepend documents and strings with byte order markers (another wasted pair of code points for the sake of an encoding issue) which mean you can't even trivially concatenate them anymore.

Meanwhile UTF-8 is backwards compatible with ASCII, byte order independent, has tons of useful properties and didn't require any Unicode code point assignments to achieve that.

The only reason we have UTF-16 is because early adopters of Unicode bet on UCS-2 and were too cheap to correct their mistake properly when it became clear that two bytes wasn't going to be enough. It's a dirty hack to cover up a mistake that should have never existed.

> The only reason we have UTF-16 is because early adopters of Unicode bet on UCS-2 and were too cheap to correct their mistake properly

That's a strange way to characterize years of backwards compatibility to deal with

https://devblogs.microsoft.com/oldnewthing/20190830-00/?p=10...

There are many OS interfaces that were deprecated after five years or even longer. It's been multiple times those five years since then and we'll likely have to deal with UTF-16 for much longer still. Having to provide backwards compatibility for UTF-16 interface doesn't mean they had to keep these as the defaults or provide new UTF-16 interfaces. In particular WIN32 already has 8-bit char interfaces that Microsoft could have easily added UTF-8 support to right then and re-blessed as the default. The decision not to do that was not a technical one but a political one.
This isn't "deprecate a few functions" -- it's basically an effort on par with migrating to Unicode in the first place.

I disagree you could just "easily" shove it into the "A" version of functions. Functions that accept UTF-8 could accept ASCII, but you can't just change the semantics of existing functions that emit text because it would blow up backwards compatibility. In a sense it is covariant but not contravariant.

And now, after you've gone through all of this effort: what was the actual payoff? And at what cost if maintaining compatibility with the other representations?

UTF-16 is the worst of all worlds. Either use UTF32 where code-points are fixed, or if you care about space efficiency use UTF8
UTF-32 is arguably even more worst of all worlds. You don't get fixed-size units in any meaningful way. Yes you have fixed sized code points, but those aren't the "units" you care about; you still have variable size grapheme clusters, so you still can't do things like reversing a string or splitting a string at an arbitrary index or anything else like that. Yet it consumes twice the space of UTF-16 for almost everything, and four times the space of UTF-8 for many things.

UTF-32 is the worst of all worlds. UTF-16 has the teeny tiny advantage that pure Chinese text takes a bit less space in UTF-16 than UTF-8 (typically irrelevant because that advantage is outweighed by the fact that the markup surrounding the text takes more space). UTF-8 is the best option for pretty much everything.

As a consequence, never use UTF-32, only use UTF-16 where necessary due to backwards compatibility, always use UTF-8 where possible.

In order to implement grapheme cluster segmentation, you have to start with a sequence of Unicode scalars. In practice, that means a sequence of 32-bit integers, which is UTF-32 in all but name. It's not a good interchange format, but it is a necessary intermediate/internal format.

There's also the problem that grapheme cluster boundaries change over time. Unicode has become a true mess.

Yeah, you need some kind of sequence of Unicode scalars. But there's no reason for that sequence to be "a contiguous chunk of memory filled with 32-bit ints" (aka a UTF-32 string); it can just as well be an iterator which operates on an in-memory UTF-8 string and produces code points.
> It's how the code point address space is defined.

Not really. Unicode is still fundamentally based off of the codepoints, which go from 0 to 2^16 + 2^20, and all of the algorithms of Unicode properties operate on these codepoints. It's just that Unicode has left open a gap of codepoints so that the upper 2^20 codepoints can be encoded in UTF-16 without risk of confusion of other UCS-2 text.

You forgot `- 2^11` for the surrogate pairs. Gee, why isn't Unicode 2^21 code points? To understand the Unicode code point space you must understand UTF-16. The code space is defined by how UTF-16 works. That was my initial point.
If you're going to count the surrogate pairs as not-a-Unicode-codepoint, you should also count the other noncharacters: the last two codepoints on each of the 17 planes and the range U+FDD0-U+FDEF.

The expansion of Unicode beyond the BMP was designed to facilitate an upgrade compatibility path from UCS-2 systems, but it is extremely incorrect to somehow equate Unicode with UTF-16.

FWIW there is an official term for "code points excluding surrogates", it is "Unicode scalar value".
OK, I'm lost here. Why is there a 1:1 correspondence between the two?
UTF-8 is superior simply because you can trivially choose to parse it as ascii and ignore all the weird foreign bytes.