|
|
|
|
|
by est
4669 days ago
|
|
> 50% is just the absolute worst case. Many kinds of textual data include large amounts of code units that fit in one byte in utf-8 For latin alphabets, yes. For CJK, it's really bad. Things get worse if you dealt with non-BMP before, like iOS emoji, which force you to upgrade MySQL to support utf8mb4, which is totally bullshit. (why the hell do people even presume utf8 is max 3 bytes?) |
|
One interesting conclusion from looking at the state of Twitter (http://blog.luminoso.com/2013/09/04/emoji-are-more-common-th...) is that CESU-8 is probably more common than real UTF-8.
Another fun thing I ran into today is that Python regular expressions allow astral characters, but you can't safely use them until 3.3 because narrow builds will quietly replace them with nonsense that doesn't run (https://github.com/LuminosoInsight/python-ftfy/commit/86aa65...). And the very reason this came up was in a workaround for a different bug in 3.3.