| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Beltalowda 1298 days ago

> UTF8 is only superior in encoding the american keyboard (7 bits of ascii). All the latin letters map to 1 byte encodings.

In practice, basic 7-bit ASCII characters dominate a lot of texts, even those in East-Asian languages, as all the markup and such around the actual text is almost always in plain ASCII. For example, if I request a Japanese document from a HTTP server then the HTTP headers, HTML tags, CSS is all in single-byte characters.

Of course, there are still many scenarios where UTF-16 "wins" in terms of size – especially with plain text – but real-world size comparisons tend to be a bit more tricky.

For example, for the current Wikipedia homepage of ja, zh, and ko:

    % file *
    ja-16.htm: HTML document, Unicode text, UTF-16, little-endian text, with very long lines (2709)
    ja-8.htm:  HTML document, Unicode text, UTF-8 text, with very long lines (2709)
    ko-16.htm: HTML document, Unicode text, UTF-16, little-endian text, with very long lines (2503)
    ko-8.htm:  HTML document, Unicode text, UTF-8 text, with very long lines (2656)
    zh-16.htm: HTML document, Unicode text, UTF-16, little-endian text, with very long lines (5589)
    zh-8.htm:  HTML document, Unicode text, UTF-8 text, with very long lines (5589)

    % ls -lh
    210K  ja-16.htm
    118K  ja-8.htm
    193K  ko-16.htm
    106K  ko-8.htm
    189K  zh-16.htm
    104K  zh-8.htm

The UTF-16 versions are all larger (and that's just the HTML, excluding the CSS, JS, etc).

I wrote a small script to count the number of "wide" multibyte characters:

    ja-8.htm  100,808 7-bit characters;  19,026 wide characters
    ko-8.htm  93,888  7-bit characters;  14,149 wide characters
    zh-8.htm  91,589  7-bit characters;  14,360 wide characters

It was higher than I expected, although not that surprising when looking at the source when you have things like:

    <div class="vector-page-toolbar">
        <div class="vector-page-toolbar-container">
            <div id="left-navigation">
                <nav aria-label="名前空間">

2 comments

mook 1298 days ago

Is there any particular reason to think HTML would be high on the list of things they thought about when they decided this? It's not like the world wide web had taken over everything yet…

link

Beltalowda 1298 days ago

I don't know the exact considerations at the time – I wasn't there, and at "only" 38 I'm too young to have lived through it. But the argument holds for many different formats and protocols, I just used HTML as it's common today and I could easily find content for it.

For example plain text emails still have considerable ASCII data in the headers, no matter which language you use to send them, and depending on the length of your email UTF-8 may be smaller than UTF-16. I just checked a very simple email in my inbox, and it has 165 lines of headers for 21 lines of actual email body text (plain email, no MIME). Even after removing the more modern headers (DKIM, X-Spam-, etc.) we're still left with 35 lines, or 1,886 bytes. The actual email text is 765 bytes, in basic English. If the email text was in CJK it would still be slightly over 1K smaller* in UTF-8 (4,181 bytes vs. 5,302).

Longer emails would be smaller in UTF-16. I don't know what the average works out to, but ~21 lines seems about average for an email, give or take.

link

knolax 1298 days ago

Your argument uses a website whose code was written by English speakers. There would still be ASCII but verbose element names like "vector-page-toolbar-container" would definitely be shorter both in utf-8 and in utf-16 if they weren't written in English

link

Beltalowda 1298 days ago

<div class=..> is still <div class=..>, no matter the language of the author. As is "background-color: #fff" in CSS. The UTF-16 is almost double the size, so you'd have to replace a lot of identifiers with 2 or 3 byte ones.

Plus many identifiers come from libraries, and when creating their own identifiers many people use either full English or partial English no matter what language (it was a huge mistake to not use English for many identifiers in my first programming job, as you will invariably end up with a mishmash of two languages).

But it is easy enough to verify this with some actual websites: https://www.rakuten.co.jp is 330K in UTF-8 and 625K in UTF-16, https://ameblo.jp is 104K in UTF-8 and 187K in UTF-16, baidu.com is 360K in UTF-8 and 717K in UTF-16, sina.com.cn: 455K, 854K, daum.net: 666K, 1.2M.

And all of that is only the HTML document; if we'd add up the CSS – where there's almost no possibility to use non-ASCII outside of class and ID names – and JavaScript – where the filesize is usually dominated by React or jQuery or whatnot – things would skew even more in favour of UTF-8.

I'm sure there are examples where a page served over UTF-16 is smaller, such as pages with very little markup (like e.g. HN), but that is the common case, even for websites exclusively written and for users of CJK languages. Someone who does not speak a word of English will save many bytes of data every day with UTF-8. There's a reason all those websites are served over UTF-8 and not UTF-16.

But for the sake of the argument, let's replace all class="...", id="..", and data-event-name=".." with strings of the same length consisting of "回". That grows the filesize from 118K to 151K, and ... it's still larger in UTF-16 with 207K. We could start replacing more stuff and eventually UTF-16 may win, maybe. But you have to use a lot of CJK. Let's use a random excerpt:

        <button
            id="回回回回回回回回回回回回回回回回回回回回回回回回"
            tabindex="-1"
            data-event-name="回回回回回回回回回回回回回回回回回回回回回回回回回回回回回回回回回回回回"

Has 91 7-bit characters and 60 multibyte ones (this includes indentation, which may not be represented 100% accurately here). If we do the math this is:

  UTF-8    91×1 + 60×3 = 271 bytes
  UTF-16   91×2 + 60×2 = 302 bytes

UTF-8 still wins.

And to repeat, there are certainly cases where UTF-16 is smaller. Markdown documents and other plain text files is an obvious one, but HTML is rarely one of them.

But imagine actually checking things before making a claim...

link

gary_0 1298 days ago

Also, HTTP tends to use gzip in-flight, and modern office document file formats also use compression, making any potential space savings for UTF-16 CJK text completely negligible in these common cases. UTF-8 and UTF-16 are basically the same size after compression[0].

[0] http://utf8everywhere.org/#asian

link

knolax 1297 days ago

> But for the sake of the argument, let's replace all class="...", id="..", and data-event-name=".." with strings of the same length consisting of "回". That grows the filesize from 118K to 151

You're missing the point entirely, the amount of characters you used is enough for 2 or 3 sentences. This was not an example constructed in good faith.

link

Beltalowda 1297 days ago

How is it "bad faith" when I spent the time actually converting and checking the entire document?

link