Hacker News new | ask | show | jobs
by thristian 4953 days ago
I don't think it's specifically a 3-byte limit, I think it's just that lots of tools decode UTF-8 into UCS-2 internally instead of UTF-16.
1 comments

That is quite clearly broken, and any tool that does so should be fixed or dumped. This is not new, and Marcus Kuhn had made UTF8 test resources available for years at http://www.cl.cam.ac.uk/~mgk25/unicode.html
I've found this sub-page super useful over the years for testing http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
I believe that MySQL is one such tool. In recent versions you can work around it by asking for the encoding "utf8mb4" instead of "utf8", but I think they have to be quite recent.

So yes, another way in which MySQL is quite clearly broken.