| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rspeer 4707 days ago
	Except most text isn't plain text. HTML pages in CJK are still smaller in UTF-8 than in their respective countries' favorite encodings.

3 comments

thaumasiotes 4707 days ago

well.... China's favorite encoding is GB, which encodes ascii values as one byte and chinese characters as two bytes. It's hard to see how UTF-8 (one byte for ascii, three for characters) would beat that, on the assumption that nearly 100% of what a chinese website would want to transmit is either ascii (where UTF-8 is equivalent to GB) or chinese (where it's inferior).

How do I know GB is preferred? I'm going off of three things:

- According to wikipedia (http://en.wikipedia.org/wiki/GB18030), software sold in China is legally required to support it.

- I was once given a chinese ebook, which I had to figure out was in GB before I could read it. (And now, I know about chardet!)

- I worked with a chinese programmer who accidentally committed files in GB, even though they were supposed to be in UTF-8.

And since the latest GB can in fact represent any unicode point, it's hard to see why it wouldn't be preferred indefinitely.

link

rspeer 4703 days ago

Okay, you are right. The comparison I was thinking of was actually about UTF-16, and of course that's not actually preferred to GB or Shift-JIS or whatever.

link

lifthrasiir 4707 days ago

You are tremendously wrong. Almost every legacy CJK encoding encodes a string in the smaller number of bytes than UTF-8 when the string in question has no unsupported characters in it. I have seen lots of (mostly misguided) people who prefer those legacy encodings over UTF-8/16 solely for this reason.

link

est 4707 days ago

> still smaller in UTF-8 than in their respective countries' favorite encodings

How so? AFAIK Shift-JIS is ASCII compatible just like UTF8, so does other double byte encodings like BIG5 and GBK.

link