| HN Mirror

We define 7 byte types: T0 0xxxxxxx 7 free bits Tx 10xxxxxx 6 free bits T1 110xxxxx 5 free bits T2 1110xxxx 4 free bits T3 11110xxx 3 free bits T4 111110xx 2 free bits T5 111111xx 2 free bits Encoding is as follows. >From hex Thru hex Sequence Bits 00000000 0000007f T0 7 00000080 000007FF T1 Tx 11 00000800 0000FFFF T2 Tx Tx 16 00010000 001FFFFF T3 Tx Tx Tx 21 00200000 03FFFFFF T4 Tx Tx Tx Tx 26 04000000 FFFFFFFF T5 Tx Tx Tx Tx Tx 32

So now we know who is really responsible for the whole MySQL utf8mb4 fiasco -- these 2 guys sitting in a diner, conjuring up a brilliant scheme to cover 4 billions characters, which turned out to exceed the actual requirement by more than 2000x.

September 1992: 2 guys scribbling on a placemat.

January 1998: RFC 2279 defines UTF-8 to be between 1 to 6 bytes.

March 2001: A bunch of CJK characters were added to Unicode Data 3.1.0, pushing the total to 94,140, exceeding the 16-bit limit of 3 bytes UTF-8.

March 2002: MySQL added support for UTF-8, initially setting the limit to 6 bytes (https://github.com/mysql/mysql-server/commit/55e0a9c)

September 2002: MySQL decided to reduce the limit to 3 bytes, probably for storage efficiency reason (https://github.com/mysql/mysql-server/commit/43a506c, https://adamhooper.medium.com/in-mysql-never-use-utf8-use-ut...)

November 2003: RFC 3629 defines UTF-8 to be between 1 to 4 bytes.

Arguably, if the placemat was smaller and the guys stopped at 4 bytes after running out of space, perhaps MySQL would have done the right thing? Ah, who am I kidding. The same commit would likely still happen.

EDIT: Just notice this in the footnotes, and the plot thickens...

> The 4, 5, and 6 byte sequences are only there for political reasons. I would prefer to delete these.

So UTF-8 was indeed intended to be utf8mb3!