Hacker News new | ask | show | jobs
by loulouxiv 1617 days ago
UTF-16 can represent the full range of Unicode codepoints by using couples of surrogates
3 comments

Technical amendment: UTF-16 can represent the full range of Unicode scalar values with surrogate pairs. Code points includes the surrogates U+D800–U+DFFF, scalar values don’t. Like all other Unicode encodings, UTF-16 cannot represent surrogates.

That’s where the real problem lies: almost nothing that uses UTF-16 actually uses UTF-16, but rather potentially ill-formed UTF-16.

You're right. Replace UTF-16 with UCS-2 and the comment sounds at least slightly more correct.
Sort of. Applications using UTF-16 have to be aware of pairs at the application level. Many are not.
This isn't a consequence of using UTF-16 as such - Java, .NET etc could totally have an API around UTF-16 strings that handles surrogate pairs. The problem, rather, is that those languages introduced a 16-bit type that they called "character", even though it wasn't even a Unicode codepoint. And then used that type throughout all string APIs, including strings themselves (indexing etc).

In .NET land you're now supposed to use https://docs.microsoft.com/en-us/dotnet/api/system.text.rune instead. It transparently handles surrogate pairs, so the app needn't be aware of anything - and yet the internal encoding is still UTF-16.