|
|
|
|
|
by aew4ytasghe5
1292 days ago
|
|
UTF8 is only superior in encoding the american keyboard (7 bits of ascii). All the latin letters map to 1 byte encodings. This is why UTF8 wasn't universally hailed to begin with. Countries like Japan had already solved encoding differently while US was still using ASCII. (See JIS encoding). As time passed tho, the byte savings matter less. Also combined with transfer compression makes the encoding size differences negligible. |
|
In practice, basic 7-bit ASCII characters dominate a lot of texts, even those in East-Asian languages, as all the markup and such around the actual text is almost always in plain ASCII. For example, if I request a Japanese document from a HTTP server then the HTTP headers, HTML tags, CSS is all in single-byte characters.
Of course, there are still many scenarios where UTF-16 "wins" in terms of size – especially with plain text – but real-world size comparisons tend to be a bit more tricky.
For example, for the current Wikipedia homepage of ja, zh, and ko:
The UTF-16 versions are all larger (and that's just the HTML, excluding the CSS, JS, etc).I wrote a small script to count the number of "wide" multibyte characters:
It was higher than I expected, although not that surprising when looking at the source when you have things like: