Hacker News new | ask | show | jobs
by bonsaibilly 1210 days ago
Thankfully MySQL also offers a non-gimped version of UTF-8 that one should always use in preference to the 3-byte version, but yeah it sucks that it's not the "obvious" version of UTF-8.
1 comments

Is this part of MySQL's policy of "do the thing I've always done, no matter how daft or broken that may be, unless I see an obscure setting telling me to do the new correct thing" ?
That'd be my guess, but I don't really know. They just left the "utf8" type as broken 3-byte gibbled UTF-8, and added the "utf8mb4" type and "utf8mb4_unicode_ci" collation for "no, actually, I want UTF-8 for real".
It will be a fun day when Unicode crosses the 5-byte UTF-8 encoding threshold :/
It won't. We settled on using stateful combining characters instead. (Remember when the selling point of switching the world to Unicode was "represent all writing systems with a single stateless 16 bit encoding"? Yeah, well, lol.)
Anything beyond four bytes is composed of multiple code points, happily
No the default these days is the saner utf8mb4, if you create a new database on a modern MySQL version. But if you have an old database using the old encoding then upgrading databases doesn't magically update the encoding because some people take backwards compatibility serious.