|
This is why UTF-8 is great. If it works for any Unicode character it will work for them all. Surrogate pairs are rare enough that they are poorly tested. With UTF-8, if there are issues with multi-byte characters, they are obvious enough to get fixed. UTF-16 is not a very good encoding. It only exists for legacy reasons. It has the same major drawback as UTF-8 (variable-length encoding) but none of the benefits (ASCII compatibility, size efficient). |
4-byte characters in UTF-8 are just as rare as surrogate pairs are just as rare in UTF-16, because they both are used to represent non-BMP characters. As a result, there is software that handles 3-byte characters (i.e., a huge percentage of what you'll ever see), but doesn't handle 4-byte characters.
MySQL is a high-profile example of software which, until recently, had this problem: http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8m....