| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by notJim 4995 days ago

This comment is somewhat misleading. The issue at hand is orthogonal to any of the benefits of UTF-8 over UTF-16 (which are real, UTF-8 is great, you should use it.)

4-byte characters in UTF-8 are just as rare as surrogate pairs are just as rare in UTF-16, because they both are used to represent non-BMP characters. As a result, there is software that handles 3-byte characters (i.e., a huge percentage of what you'll ever see), but doesn't handle 4-byte characters.

MySQL is a high-profile example of software which, until recently, had this problem: http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8m....

1 comments

snogglethorpe 4994 days ago

The problem is that handling 1 unit is very different from 2+ units, in terms of coding patterns, whereas 3 is not so different from 4+. In the latter case there's already probably a loop to handle multiple unit characters, which will in many cases work without change for longer sequences (and if not, probably the code probably requires very little change to do so).

So whereas it's rather common for programs to mis-handle multiple-unit UTF-16 characters, it seems much less likely that programs will mis-handle 4+ unit UTF-8 characters.