Hacker News new | ask | show | jobs
by pixelcort 4948 days ago
The problem with UTF-8 is that lots of tools have 3 byte limits, and characters like Emoji take up 4 bytes in UTF-8.
6 comments

Honest question: but isn't that just a broken implementation (and a very obvious brokenness at that)? It seems to me there's a big difference between someone not coding to the standard, and the standard making your taks impossible.
The same could be said of UTF-16 implementations that don't support surrogate pairs.
But that's not true for UCS-2, which can't represent certain code points.

On the other hand, UTF-8 is ASCII compatible and more efficient for text that's primarily ASCII.

If most implementations are broken, it becomes the standard.
How many tools have 3-byte limits on UTF-8? The only one I can think of right now is MySQL. (The workaround is to specify the utf8mb4 character set. This is MySQL's cryptic internal name for "actually doing UTF-8 correctly.")
MySQL is one of the worst offenders for broken Unicode and collation problems arising therein. Neither it nor JavaScript deserve consideration for problems that need robust Unicode handling.
I actually switched my (low traffic, low performance needs) blog comments database from MySQL to SQLite purely because I could not make MySQL and Unicode get along. All I needed was for it to accept and then regurgitate UTF-8 and it couldn't even handle that. I'm sure it can be done, but none of the incantations I tried made it work, and it was ultimately easier for me to switch databases.
As an ugly last resort, you could store Unicode as UTF-8 in BLOB fields. MySQL is pretty good about storing binary data. (I dread the day that I'll have to do something more advanced with Unicode in MySQL than just storing it.)
I no longer recall whether I tried that and failed, or didn't get that far. Seems like a semi-reasonable approach if you don't need the database to be able to understand the contents of that column. But on the other hand, SQLite is working great for my needs too.
I don't think it's specifically a 3-byte limit, I think it's just that lots of tools decode UTF-8 into UCS-2 internally instead of UTF-16.
That is quite clearly broken, and any tool that does so should be fixed or dumped. This is not new, and Marcus Kuhn had made UTF8 test resources available for years at http://www.cl.cam.ac.uk/~mgk25/unicode.html
I've found this sub-page super useful over the years for testing http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
I believe that MySQL is one such tool. In recent versions you can work around it by asking for the encoding "utf8mb4" instead of "utf8", but I think they have to be quite recent.

So yes, another way in which MySQL is quite clearly broken.

Which tools?

Honest question, as the three byte limit seems rather arbitrary and no more logical than, say, a four byte one.

It is totally arbitrary - there's no reason you can't have degenerate 6-byte encodings, and compliant decoders should cope with them. See Marcus Kuhn's excellent UTF-8 decoder torture test page linked elsewhere in this thread.
Actually, a truly compliant UTF-8 decoder should reject all degenerate forms, because they can be used to bypass validation checks, and the spec does not allow such forms. A four-byte UTF-8 sequence is sufficient to represent any Unicode code point (and then some), so no compliant decoder should accept any sequences of 5 or 6 bytes.

Note that Kuhn's torture test page deliberately includes a lot of invalid sequences in order to make sure that they're gracefully handled. Section 4 is dedicated to degenerate encodings like this.

It's not 3 bytes so much as 16 bits, aka Unicode 1.0 limits. Which turn into 3 bytes in UTF-8.
File bugs with those tools. These sort of issues should have been sorted years ago and any program that can't do 3+ byte character encodings should be named and shamed.
And the problem with UTF-16 is that a lot of applications can't handle surrogate pairs, except a lot of Emoji are above the BMP, aren't they? So why is this a bigger deal for UTF-8 than UTF-16?
> except a lot of Emoji are above the BMP, aren't they?

All of the Unicode 6.0 emoji are.