| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by est 4669 days ago
	> 50% is just the absolute worst case. Many kinds of textual data include large amounts of code units that fit in one byte in utf-8 For latin alphabets, yes. For CJK, it's really bad. Things get worse if you dealt with non-BMP before, like iOS emoji, which force you to upgrade MySQL to support utf8mb4, which is totally bullshit. (why the hell do people even presume utf8 is max 3 bytes?)

3 comments

rspeer 4669 days ago

Because people either don't know anything outside of the BMP exists, or they think astral characters are only for dead languages (they haven't had the dawning realization about emoji yet), or they use a programming language like Java that accidentally implemented CESU-8 and called it "UTF8" a decade and a half ago and isn't allowed to fix it.

One interesting conclusion from looking at the state of Twitter (http://blog.luminoso.com/2013/09/04/emoji-are-more-common-th...) is that CESU-8 is probably more common than real UTF-8.

Another fun thing I ran into today is that Python regular expressions allow astral characters, but you can't safely use them until 3.3 because narrow builds will quietly replace them with nonsense that doesn't run (https://github.com/LuminosoInsight/python-ftfy/commit/86aa65...). And the very reason this came up was in a workaround for a different bug in 3.3.

link

kps 4669 days ago

    > ... MySQL ...
    > why the hell do people even presume utf8 is max 3 bytes?

I think you answered your own question before you even asked.

link

rspeer 4669 days ago

Except most text isn't plain text. HTML pages in CJK are still smaller in UTF-8 than in their respective countries' favorite encodings.

link

thaumasiotes 4669 days ago

well.... China's favorite encoding is GB, which encodes ascii values as one byte and chinese characters as two bytes. It's hard to see how UTF-8 (one byte for ascii, three for characters) would beat that, on the assumption that nearly 100% of what a chinese website would want to transmit is either ascii (where UTF-8 is equivalent to GB) or chinese (where it's inferior).

How do I know GB is preferred? I'm going off of three things:

- According to wikipedia (http://en.wikipedia.org/wiki/GB18030), software sold in China is legally required to support it.

- I was once given a chinese ebook, which I had to figure out was in GB before I could read it. (And now, I know about chardet!)

- I worked with a chinese programmer who accidentally committed files in GB, even though they were supposed to be in UTF-8.

And since the latest GB can in fact represent any unicode point, it's hard to see why it wouldn't be preferred indefinitely.

link

rspeer 4665 days ago

Okay, you are right. The comparison I was thinking of was actually about UTF-16, and of course that's not actually preferred to GB or Shift-JIS or whatever.

link

lifthrasiir 4669 days ago

You are tremendously wrong. Almost every legacy CJK encoding encodes a string in the smaller number of bytes than UTF-8 when the string in question has no unsupported characters in it. I have seen lots of (mostly misguided) people who prefer those legacy encodings over UTF-8/16 solely for this reason.

link

est 4669 days ago

> still smaller in UTF-8 than in their respective countries' favorite encodings

How so? AFAIK Shift-JIS is ASCII compatible just like UTF8, so does other double byte encodings like BIG5 and GBK.

link