Hacker News new | ask | show | jobs
by wvenable 99 days ago
I don't think it was clear at the time that UTF-8 would take off. UCS-2 and then UTF-16 was well established by 2000 in both Microsoft technologies and elsewhere (like Java). Linux, despite the existence of UTF-8, would still take years to get acceptable internationalization support. Developing good and secure internationalization is a hard problem -- it took a long time for everyone.

It's now 2026, everything always looks different in hindsight.

3 comments

I don’t remember it quite that way. Localization was a giant question, sure. Are we using C or UTF-8 for the default locale? That had lots of screaming matches. But in the network service world, I don’t remember ever hearing more than a token resistance against choosing UTF-8 as the successor to ASCII. It was a huge win, especially since ASCII text is already valid UTF-8 text. Make your browser default to parsing docs with that encoding and you can still parse all existing ASCII docs with zero changes! That was a huge, enormous selling point.

Windows is far from a niche player, to be sure. Yet it seems like literally every other OS but them was going with one encoding for everything, while they went in a totally different direction that got complaints even then. I truly believe they thought they’d win that battle and eventually everyone else would move to UTF-16 to join them. Meanwhile, every other OS vendor was like, nah, no way we’re rewriting everything from scratch to work with a not-backward compatible encoding.

Microsoft did the hard work of supporting Unicode when UTF-8 didn't exist (and mostly when UTF-16 didn't exist).

Any system that continued with only ASCII well into the 2000s could mostly just jump into UTF-8 without issue. Doing nothing for non-English users for almost two decades turned out to be a solid plan long term. Microsoft certainly didn't have that option.

Blame Java - their use of utf-16 is the sole reason that Microsoft chose it.

Sun sued Microsoft in 1996 for making nonportable extensions to Java (a license violation). Microsoft lost, and created C# in 2000.

At the time, “Starting Java” was the most feared message on the internet. People really thought that in-browser Java would take over over the world (yes Java, not Javascript)

Sun chose UTF16 in 1995 believing that Unicode would never need more than 64k characters. In 1996 that changed. UTF16 got variable length encoding and became a white elephant

So Microsoft chose UTF16 know full well that it had no advantages. But at least they can say code pages were far worse :)

At the time it was introduced it was understandable, and Microsoft also needed some time to implement it before that of course. But by about 2000 it was clear that UTF-8 was going to win, and Microsoft should have just properly implemented it in NT instead of dithering about for the next almost 20 years. Linux had quite good support of it by then.