| HN Mirror

UTF-16 is not always 16b/codepoint. UCS-2, its ancestor and more or less what is actually used by Windows when it talks about "Unicode", was 16b/codepoint. It turned out that 64k codepoints wasn't enough for everybody, so they increased the range and added "surrogate pairs" to UCS-2 to make UTF-16. Codepoints in UTF-16 are, therefore, either 16b/codepoint or 32b/codepoint. Many systems that adopted UCS-2 claim UTF-16 compatability but in fact allow unpaired surrogate pairs (an error in UTF-16). The necessity of encoding such invalid UTF-16 strings in UTF-8 under certain circumstances lead to the pseudo-standard known as WTF-8[0].

UTF-8 was designed (as legend has it, on the back of a napkin during a dinner break) after the increase in range and doesn't suffer from the same problem. Additionally, it's straight-forward "continuation" mechanism isn't any more difficult to deal with than surrogate pairs, and it doesn't have any endianess issues like UTF-16/UCS-2.

[0]https://simonsapin.github.io/wtf-8/