Hacker News new | ask | show | jobs
by account42 285 days ago
We can argue about "biggest" all day long but UTF-16 is a huge design failure because it made a huge chunk of the lower Unicode space unusable, thereby making better encodings like UTF-8 that could easily represent those code points less efficient. This layer-violating hack should have made it clear that UTF-16 was a bad idea from the start.

Then there is also the issue that technically there is no such thing as UTF-16, instead you need to distinguish UTF-16LE and UTF-16BE. Even though approximately no one uses the latter we still can't ignore it and have to prepend documents and strings with byte order markers (another wasted pair of code points for the sake of an encoding issue) which mean you can't even trivially concatenate them anymore.

Meanwhile UTF-8 is backwards compatible with ASCII, byte order independent, has tons of useful properties and didn't require any Unicode code point assignments to achieve that.

The only reason we have UTF-16 is because early adopters of Unicode bet on UCS-2 and were too cheap to correct their mistake properly when it became clear that two bytes wasn't going to be enough. It's a dirty hack to cover up a mistake that should have never existed.

1 comments

> The only reason we have UTF-16 is because early adopters of Unicode bet on UCS-2 and were too cheap to correct their mistake properly

That's a strange way to characterize years of backwards compatibility to deal with

https://devblogs.microsoft.com/oldnewthing/20190830-00/?p=10...

There are many OS interfaces that were deprecated after five years or even longer. It's been multiple times those five years since then and we'll likely have to deal with UTF-16 for much longer still. Having to provide backwards compatibility for UTF-16 interface doesn't mean they had to keep these as the defaults or provide new UTF-16 interfaces. In particular WIN32 already has 8-bit char interfaces that Microsoft could have easily added UTF-8 support to right then and re-blessed as the default. The decision not to do that was not a technical one but a political one.
This isn't "deprecate a few functions" -- it's basically an effort on par with migrating to Unicode in the first place.

I disagree you could just "easily" shove it into the "A" version of functions. Functions that accept UTF-8 could accept ASCII, but you can't just change the semantics of existing functions that emit text because it would blow up backwards compatibility. In a sense it is covariant but not contravariant.

And now, after you've gone through all of this effort: what was the actual payoff? And at what cost if maintaining compatibility with the other representations?