| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by aerolite 4681 days ago
	More ethnocentrism. UTF-16 results in less bandwidth for any language that has non-ASCII characters.

2 comments

stormbrew 4680 days ago

I think the Unicode group's handling of the round trip issues faced by Japanese and Chinese users with their own encodings due to Han Unification is a bigger ethnocentric problem by far. I'm not unaware of these issues.

But it's not like utf-16 was invented to solve that problem so much as it was an accident that in the early days of unicode 2 bytes was enough to encode all code points and a 2 byte encoding was appealing as a universal solution in a way that a 3 or 4 byte encoding never would be. Some group of people on the planet will have to suffer 3 or more byte encodings no matter what. And I'd be happy to make it mine if it meant we could go with the better technical solution, if I had the power to do that.

All that aside it's certainly not quite as clear cut as any language. Many languages have a mix of ascii and non-ascii characters and those fare better than they would with utf-16. And in the codepoints below 0x07ff (where utf-8 goes up to 3 bytes) there are character sets like greek, hebrew, arabic, and cyrillic. And as has been pointed out, ascii characters are often used in structural elements of documents and this factors into making it unlikely that the general case is as bad as the worst case (50% inflation) for most practical text. Particularly when you throw compression into the mix.

[edit] fixed some minor things.

link

pedrocr 4681 days ago

That's not true. For anything that uses the latin alphabet utf8 will be strictly better as the majority of characters will be in ASCII (1 byte) and the rest will just need an extra byte, equaling utf16.

Looking at the format definition[1] anything below U+0800 will fit in 2 bytes in UTF8. So the ethnocentrism starts at Samaritan[2]. The Japanese, Chinese and Indian scripts are probably the most important in that range.

[1] https://en.wikipedia.org/wiki/UTF-8#Description

[2] http://codepoints.net/basic_multilingual_plane

link

gsnedders 4681 days ago

I can't find the research hsivonen (probably best known for validator.nu and Gecko's HTML parser) did into this a while ago, but he concluded that for the top 100 (IIRC) CJK sites, they were smaller as gzip'd UTF-8 than gzip'd UTF-16, because they all contained a sufficient quantity of HTML/CSS/JS that the gain from the shorter representation of them outweighed that of the text itself.

link