|
|
|
|
|
by oofabz
4544 days ago
|
|
I know that both encodings are variable-length. That is the issue I am trying to address. My point is that in UTF-16 it's too easy to ignore surrogate pairs. Lots of UTF-16 software fails to handle variable-length characters because they are so rare. But in UTF-8 you can't ignore multi-byte characters without obvious bugs. These bugs are noticed and fixed more quickly than UTF-16 surrogate pair bugs. This makes UTF-8 more reliable. I am not sure why you think I am advocating UTF-16. I said almost nothing good about it. |
|
UTF-16 has an advantage in that there's fewer failure modes, and fewer ways for a string to be invalid.
edit: As for surrogate pairs, this is an issue, but I think it's overstated. A naïve program may accidentally split a UTF-16 surrogate pair, but that same program is just as liable to accidentally split a decomposed character sequence in UTF-8. You have to deal with those issues regardless of encoding.