Hacker News new | ask | show | jobs
by pedrocr 4681 days ago
That's not true. For anything that uses the latin alphabet utf8 will be strictly better as the majority of characters will be in ASCII (1 byte) and the rest will just need an extra byte, equaling utf16.

Looking at the format definition[1] anything below U+0800 will fit in 2 bytes in UTF8. So the ethnocentrism starts at Samaritan[2]. The Japanese, Chinese and Indian scripts are probably the most important in that range.

[1] https://en.wikipedia.org/wiki/UTF-8#Description

[2] http://codepoints.net/basic_multilingual_plane

1 comments

I can't find the research hsivonen (probably best known for validator.nu and Gecko's HTML parser) did into this a while ago, but he concluded that for the top 100 (IIRC) CJK sites, they were smaller as gzip'd UTF-8 than gzip'd UTF-16, because they all contained a sufficient quantity of HTML/CSS/JS that the gain from the shorter representation of them outweighed that of the text itself.