Hacker News new | ask | show | jobs
by gotthemwmds 3480 days ago
I ran into this at a consulting job recently...

MySQL claims to support utf8, but in reality, it doesn't. You need utf8mb4 to support certain common Kanji characters.

This company had spent untold thousands (possibly millions) trying to convert gigantic databases (and I don't use the term gigantic loosely...) from utf8 to utf8mb4 because some of their Japan-based clients were using Kanji.

Sounds easy right? Wrong. utf8mb4 comes with some technical "gotchas" (google it) that had delayed the attempt to change to it by almost a year.

Anyway, I found this pretty amusing, and got a huge paycheck to explain to them just how screwed they were.

3 comments

A while back I found a minor bug in Thunderbird, where astral plane code points were considered by the line wrapping algorithm to have a width of 2 rather than 1. So I filed a bug report.

Oh, boy.

Turns out that trying to include astral plane code points in whichever version of Bugzilla that Mozilla uses causes comments to be silently truncated! Because MySQL.

I filed that one in 2010; it got deduped against a bug originally filed in 2007; it is now 2016, and the bug is RESOLVED FIXED, and Mozilla's bugzilla still has the same problem.

https://bugzilla.mozilla.org/show_bug.cgi?id=405011

At least the original Thunderbird bug has been fixed.

MySQL's issues aren't just about character width.

When I worked at Mozilla I was on the MDN (developer.mozilla.org) team, and we had this inexplicable bug: articles can be categorized with tags, and both articles and tags are localizable for all the languages MDN supports. So, for example, English reference articles on CSS properties were tagged "CSS Reference", while French reference articles on CSS properties were tagged "CSS Référence".

And... sometimes an English article's page would show it as having the French ("Référence") tag, and sometimes the French article's page would show it as having the English article's tag.

Turns out, MySQL's case-insensitive UTF-8 collation treated "e" and "é" as the same character. We didn't know about that, and hadn't noticed because the tagging library we used worked around it. Until one day a new version of it didn't, and tags from one language would start showing on another language's articles (if the words were the same, aside from diacritics/accents on certain characters). Which led to this:

https://github.com/mozilla/kuma/blob/00fc05b101658f863f58d7f...

That's a custom MySQL collation, which MDN defines and installs, to work around MySQL's default inability to tell "e" and "é" apart.

> utf8mb4 comes with some technical "gotchas" (google it)

I know InnoDB limits index sizes to 767 bytes, meaning VARCHAR(255) using utf8 can have all 255 characters indexed, but VARCHAR(255) using utf8mb4 can only index 191 characters (floor(767/4) == 191).

After a quick Google search, that seems to be the most common gotcha. What other gotchas did you have in mind?

This was definitely the first thing that came up, as you found.

To be honest, I just don't remember. There was something about something that made something scary to the PM who was in charge of it all? That is about the best I can come up with.

I want to say the needed to index more than 191 chars, but that seems like a stupid thing to say. Who needs to index that many chars?

If I remember, I'll edit :)

edit: I guess I should say I was consulted to do some unrelated things, then helped them with some MySQL stuff that came up towards the end of the contract, then the utf8mb4 stuff came up, and I spent some time going through it with them. It was not the main focus of the contract, which is part of why I don't remember it very well. Just something that came up in the day to day...