| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pixelcort 4995 days ago
	The problem with UTF-8 is that lots of tools have 3 byte limits, and characters like Emoji take up 4 bytes in UTF-8.

6 comments

potatolicious 4995 days ago

Honest question: but isn't that just a broken implementation (and a very obvious brokenness at that)? It seems to me there's a big difference between someone not coding to the standard, and the standard making your taks impossible.

link

codeka 4995 days ago

The same could be said of UTF-16 implementations that don't support surrogate pairs.

link

tlrobinson 4995 days ago

But that's not true for UCS-2, which can't represent certain code points.

On the other hand, UTF-8 is ASCII compatible and more efficient for text that's primarily ASCII.

link

dsl 4995 days ago

If most implementations are broken, it becomes the standard.

link

pjscott 4995 days ago

How many tools have 3-byte limits on UTF-8? The only one I can think of right now is MySQL. (The workaround is to specify the utf8mb4 character set. This is MySQL's cryptic internal name for "actually doing UTF-8 correctly.")

link

jrabone 4995 days ago

MySQL is one of the worst offenders for broken Unicode and collation problems arising therein. Neither it nor JavaScript deserve consideration for problems that need robust Unicode handling.

link

mikeash 4995 days ago

I actually switched my (low traffic, low performance needs) blog comments database from MySQL to SQLite purely because I could not make MySQL and Unicode get along. All I needed was for it to accept and then regurgitate UTF-8 and it couldn't even handle that. I'm sure it can be done, but none of the incantations I tried made it work, and it was ultimately easier for me to switch databases.

link

pjscott 4995 days ago

As an ugly last resort, you could store Unicode as UTF-8 in BLOB fields. MySQL is pretty good about storing binary data. (I dread the day that I'll have to do something more advanced with Unicode in MySQL than just storing it.)

link

mikeash 4994 days ago

I no longer recall whether I tried that and failed, or didn't get that far. Seems like a semi-reasonable approach if you don't need the database to be able to understand the contents of that column. But on the other hand, SQLite is working great for my needs too.

link

thristian 4995 days ago

I don't think it's specifically a 3-byte limit, I think it's just that lots of tools decode UTF-8 into UCS-2 internally instead of UTF-16.

link

jrabone 4995 days ago

That is quite clearly broken, and any tool that does so should be fixed or dumped. This is not new, and Marcus Kuhn had made UTF8 test resources available for years at http://www.cl.cam.ac.uk/~mgk25/unicode.html

link

mef 4995 days ago

I've found this sub-page super useful over the years for testing http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

link

thristian 4995 days ago

I believe that MySQL is one such tool. In recent versions you can work around it by asking for the encoding "utf8mb4" instead of "utf8", but I think they have to be quite recent.

So yes, another way in which MySQL is quite clearly broken.

link

eps 4995 days ago

Which tools?

Honest question, as the three byte limit seems rather arbitrary and no more logical than, say, a four byte one.

link

jrabone 4995 days ago

It is totally arbitrary - there's no reason you can't have degenerate 6-byte encodings, and compliant decoders should cope with them. See Marcus Kuhn's excellent UTF-8 decoder torture test page linked elsewhere in this thread.

link

mikeash 4994 days ago

Actually, a truly compliant UTF-8 decoder should reject all degenerate forms, because they can be used to bypass validation checks, and the spec does not allow such forms. A four-byte UTF-8 sequence is sufficient to represent any Unicode code point (and then some), so no compliant decoder should accept any sequences of 5 or 6 bytes.

Note that Kuhn's torture test page deliberately includes a lot of invalid sequences in order to make sure that they're gracefully handled. Section 4 is dedicated to degenerate encodings like this.

link

masklinn 4995 days ago

It's not 3 bytes so much as 16 bits, aka Unicode 1.0 limits. Which turn into 3 bytes in UTF-8.

link

Tsagadai 4995 days ago

File bugs with those tools. These sort of issues should have been sorted years ago and any program that can't do 3+ byte character encodings should be named and shamed.

link

derleth 4995 days ago

And the problem with UTF-16 is that a lot of applications can't handle surrogate pairs, except a lot of Emoji are above the BMP, aren't they? So why is this a bigger deal for UTF-8 than UTF-16?

link

masklinn 4995 days ago

> except a lot of Emoji are above the BMP, aren't they?

All of the Unicode 6.0 emoji are.

link