| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by millstone 4551 days ago

Bugs in UTF-8 handling of multibyte sequences need not be obvious. Google "CAPEC-80."

UTF-16 has an advantage in that there's fewer failure modes, and fewer ways for a string to be invalid.

edit: As for surrogate pairs, this is an issue, but I think it's overstated. A naïve program may accidentally split a UTF-16 surrogate pair, but that same program is just as liable to accidentally split a decomposed character sequence in UTF-8. You have to deal with those issues regardless of encoding.

1 comments

lmm 4550 days ago

> A naïve program may accidentally split a UTF-16 surrogate pair, but that same program is just as liable to accidentally split a decomposed character sequence in UTF-8. You have to deal with those issues regardless of encoding.

The point is that using UTF-8 makes these issues more obvious. Most programmers these days think to test with non-ascii characters. Fewer think to test with astral characters.

link