| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by layer8 39 days ago

> UTF-16 was unforced error (and I still can’t work out why it wasn’t obvious from the start that UCS-2 would never be enough).

ISO 10646 (“Universal Coded Character Set”) planned for 31-bit code points from the start (128 groups of 256 planes of 256 rows of 256 cells, with UCS-4 as a four-byte encoding), around 1989. Unicode, on the other hand, was a parallel effort initiated by Xerox and Apple a few years earlier, with more pragmatic aims, defining a 16-bit character set (but no encoding) that would allow round-tripping of existing character sets. For Unicode 1.1, it was decided to align it with ISO 10646 and make it coincide with the latter’s first plane (the BMP) and UCS-2. In Unicode 2.0, surrogate pairs and the UTF-16 encoding were introduced to allow future expansion to additional planes, in a way that would be compatible with existing implementations. Only with Unicode 3.1 in 2001, five years after Unicode 2.0 and ten years after Unicode 1.0, were actual characters assigned beyond the BMP.

History is complicated; aims, incentives, and constraints change over time.