|
|
|
|
|
by stormbrew
4665 days ago
|
|
That's a reasonable replacement for ucs-4 for an internal representation, but it's not actually a character encoding like utf-8 and utf-16 are. It's just a tagged union of several encodings. As for the inflation issue, 50% is just the absolute worst case. Many kinds of textual data include large amounts of code units that fit in one byte in utf-8 and 2 bytes in utf-16. It tends to even out somewhat. And if you really want your data to be small, gzip will do a better job than either. |
|
For latin alphabets, yes. For CJK, it's really bad. Things get worse if you dealt with non-BMP before, like iOS emoji, which force you to upgrade MySQL to support utf8mb4, which is totally bullshit. (why the hell do people even presume utf8 is max 3 bytes?)