Hacker News new | ask | show | jobs
by speeder 2867 days ago
Utf-16 is always 16 bit, utf-8 is variable, can go from 8 to 32 as needed...

This does make coding for utf-8 harder, but when it works is really wonderful stuff.

3 comments

UTF-16 is not always 16b/codepoint. UCS-2, its ancestor and more or less what is actually used by Windows when it talks about "Unicode", was 16b/codepoint. It turned out that 64k codepoints wasn't enough for everybody, so they increased the range and added "surrogate pairs" to UCS-2 to make UTF-16. Codepoints in UTF-16 are, therefore, either 16b/codepoint or 32b/codepoint. Many systems that adopted UCS-2 claim UTF-16 compatability but in fact allow unpaired surrogate pairs (an error in UTF-16). The necessity of encoding such invalid UTF-16 strings in UTF-8 under certain circumstances lead to the pseudo-standard known as WTF-8[0].

UTF-8 was designed (as legend has it, on the back of a napkin during a dinner break) after the increase in range and doesn't suffer from the same problem. Additionally, it's straight-forward "continuation" mechanism isn't any more difficult to deal with than surrogate pairs, and it doesn't have any endianess issues like UTF-16/UCS-2.

[0]https://simonsapin.github.io/wtf-8/

UCS-2 is always 2 octets, UTF-16 allows for so-called surrogate pairs which expands the bits available making it variable size.
Look up UCS-2 that I mentioned in my comment.