Hacker News new | ask | show | jobs
by oofabz 4544 days ago
I know that both encodings are variable-length. That is the issue I am trying to address.

My point is that in UTF-16 it's too easy to ignore surrogate pairs. Lots of UTF-16 software fails to handle variable-length characters because they are so rare. But in UTF-8 you can't ignore multi-byte characters without obvious bugs. These bugs are noticed and fixed more quickly than UTF-16 surrogate pair bugs. This makes UTF-8 more reliable.

I am not sure why you think I am advocating UTF-16. I said almost nothing good about it.

1 comments

Bugs in UTF-8 handling of multibyte sequences need not be obvious. Google "CAPEC-80."

UTF-16 has an advantage in that there's fewer failure modes, and fewer ways for a string to be invalid.

edit: As for surrogate pairs, this is an issue, but I think it's overstated. A naïve program may accidentally split a UTF-16 surrogate pair, but that same program is just as liable to accidentally split a decomposed character sequence in UTF-8. You have to deal with those issues regardless of encoding.

> A naïve program may accidentally split a UTF-16 surrogate pair, but that same program is just as liable to accidentally split a decomposed character sequence in UTF-8. You have to deal with those issues regardless of encoding.

The point is that using UTF-8 makes these issues more obvious. Most programmers these days think to test with non-ascii characters. Fewer think to test with astral characters.