|
|
|
|
|
by notJim
4948 days ago
|
|
This comment is somewhat misleading. The issue at hand is orthogonal to any of the benefits of UTF-8 over UTF-16 (which are real, UTF-8 is great, you should use it.) 4-byte characters in UTF-8 are just as rare as surrogate pairs are just as rare in UTF-16, because they both are used to represent non-BMP characters. As a result, there is software that handles 3-byte characters (i.e., a huge percentage of what you'll ever see), but doesn't handle 4-byte characters. MySQL is a high-profile example of software which, until recently, had this problem: http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8m.... |
|
So whereas it's rather common for programs to mis-handle multiple-unit UTF-16 characters, it seems much less likely that programs will mis-handle 4+ unit UTF-8 characters.