|
|
|
|
|
by millstone
4551 days ago
|
|
Bugs in UTF-8 handling of multibyte sequences need not be obvious. Google "CAPEC-80." UTF-16 has an advantage in that there's fewer failure modes, and fewer ways for a string to be invalid. edit: As for surrogate pairs, this is an issue, but I think it's overstated. A naïve program may accidentally split a UTF-16 surrogate pair, but that same program is just as liable to accidentally split a decomposed character sequence in UTF-8. You have to deal with those issues regardless of encoding. |
|
The point is that using UTF-8 makes these issues more obvious. Most programmers these days think to test with non-ascii characters. Fewer think to test with astral characters.