| So now we know who is really responsible for the whole MySQL utf8mb4 fiasco -- these 2 guys sitting in a diner, conjuring up a brilliant scheme to cover 4 billions characters, which turned out to exceed the actual requirement by more than 2000x. September 1992: 2 guys scribbling on a placemat. January 1998: RFC 2279 defines UTF-8 to be between 1 to 6 bytes. March 2001: A bunch of CJK characters were added to Unicode Data 3.1.0, pushing the total to 94,140, exceeding the 16-bit limit of 3 bytes UTF-8. March 2002: MySQL added support for UTF-8, initially setting the limit to 6 bytes (https://github.com/mysql/mysql-server/commit/55e0a9c) September 2002: MySQL decided to reduce the limit to 3 bytes, probably for storage efficiency reason (https://github.com/mysql/mysql-server/commit/43a506c, https://adamhooper.medium.com/in-mysql-never-use-utf8-use-ut...) November 2003: RFC 3629 defines UTF-8 to be between 1 to 4 bytes. Arguably, if the placemat was smaller and the guys stopped at 4 bytes after running out of space, perhaps MySQL would have done the right thing? Ah, who am I kidding. The same commit would likely still happen. EDIT: Just notice this in the footnotes, and the plot thickens... > The 4, 5, and 6 byte sequences are only there for
political reasons. I would prefer to delete these. So UTF-8 was indeed intended to be utf8mb3! |