|
> UTF8 is only superior in encoding the american keyboard (7 bits of ascii). All the latin letters map to 1 byte encodings. In practice, basic 7-bit ASCII characters dominate a lot of texts, even those in East-Asian languages, as all the markup and such around the actual text is almost always in plain ASCII. For example, if I request a Japanese document from a HTTP server then the HTTP headers, HTML tags, CSS is all in single-byte characters. Of course, there are still many scenarios where UTF-16 "wins" in terms of size – especially with plain text – but real-world size comparisons tend to be a bit more tricky. For example, for the current Wikipedia homepage of ja, zh, and ko: % file *
ja-16.htm: HTML document, Unicode text, UTF-16, little-endian text, with very long lines (2709)
ja-8.htm: HTML document, Unicode text, UTF-8 text, with very long lines (2709)
ko-16.htm: HTML document, Unicode text, UTF-16, little-endian text, with very long lines (2503)
ko-8.htm: HTML document, Unicode text, UTF-8 text, with very long lines (2656)
zh-16.htm: HTML document, Unicode text, UTF-16, little-endian text, with very long lines (5589)
zh-8.htm: HTML document, Unicode text, UTF-8 text, with very long lines (5589)
% ls -lh
210K ja-16.htm
118K ja-8.htm
193K ko-16.htm
106K ko-8.htm
189K zh-16.htm
104K zh-8.htm
The UTF-16 versions are all larger (and that's just the HTML, excluding the CSS, JS, etc).I wrote a small script to count the number of "wide" multibyte characters: ja-8.htm 100,808 7-bit characters; 19,026 wide characters
ko-8.htm 93,888 7-bit characters; 14,149 wide characters
zh-8.htm 91,589 7-bit characters; 14,360 wide characters
It was higher than I expected, although not that surprising when looking at the
source when you have things like: <div class="vector-page-toolbar">
<div class="vector-page-toolbar-container">
<div id="left-navigation">
<nav aria-label="名前空間">
|