| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by layer8 274 days ago

The history is more complicated than that. Originally, ISO/IEC 10646 and Unicode were two separate efforts, with only ISO having 31-ish-bit ideas, and Unicode being strictly 16 bits [0][1]. UTF-8 as in TFA was clearly developed to cover the 31-bit ISO character collection, the encoding going up to 6 bytes per character, although a couple of years later this was restricted to the 4 bytes sufficient to cover the 20 bits of Unicode 2.0 (1996). The initial UTF-8 development is therefore somewhat beyond the scope of what Unicode 1.x was about at the time.

Furthermore, the development of Windows NT already began in 1989 (then planned as OS/2 3.0) and proceeded in parallel to the finalization of Unicode 1.0, and to its eventual adoption by ISO that lead to Unicode 1.1 and ISO/IEC 10646-1:1993. It was natural to adopt that standardization effort.

Once established, the 16-bit encoding used by Windows NT was engrained in kernel and userspace APIs, notably the BSTR string type used by Visual Basic and COM, and importantly in NTFS. Adopting UTF-8 for Windows XP would have provided little benefit at that point, while causing a lot of complications. For backwards compatibility, something like WTF-8 would effectively have been required, and there would have been an additional performance penalty for converting back and forth from the existing WCHAR/BSTR APIs and serializations. It wasn't remotely a viable opportunity for such a far-reaching change.

Lastly, my recollection is that UTF-8 only became really widespread on the web some time after the release of Windows XP (2001), maybe roughly around Vista.

[0] https://en.wikipedia.org/wiki/Universal_Coded_Character_Set#...

[1] "Internationalization and character set standards", September 1993, https://dl.acm.org/doi/pdf/10.1145/174683.174687