The good point (in my opinion) is not that "ASCII takes 1 byte, BMP takes 2 bytes, everything else 4 bytes", but rather that the exposed API hides this from you, and exposes to you a sequence of code points. This, I hope, will reduce errors, as code points, not code units, is often a better abstraction to be working with. (For some random string processing function.)
So far as I know, Haskell is the only other language that I know of that exposes, as the defaultish-native interface, Unicode strings as a sequence or iterable of code points (by just using UTF-32). Java, C#, your-language-here all do code units. C++'s template are powerful enough that someone could make unicode_str<encoding_to_store_as>, but I've not seen one.
Code points is a better abstraction than code units, but it's still a piss-poor abstraction.
Consider the problem of producing a valid substring from a Unicode string. It's important that you not split surrogate pairs, and it's true working with code points spares you from that particular problem. But it's also important that you not split combining marks, and zero width joiners, and Hangul syllables... (see http://www.unicode.org/reports/tr29/ for all the gory details).
An average programmer cannot correctly extract a substring from a Unicode string whether given the code units or the code points. These abstractions are inadequate: instead you want something like grapheme clusters.
That's a reasonable replacement for ucs-4 for an internal representation, but it's not actually a character encoding like utf-8 and utf-16 are. It's just a tagged union of several encodings.
As for the inflation issue, 50% is just the absolute worst case. Many kinds of textual data include large amounts of code units that fit in one byte in utf-8 and 2 bytes in utf-16. It tends to even out somewhat. And if you really want your data to be small, gzip will do a better job than either.
> 50% is just the absolute worst case. Many kinds of textual data include large amounts of code units that fit in one byte in utf-8
For latin alphabets, yes. For CJK, it's really bad. Things get worse if you dealt with non-BMP before, like iOS emoji, which force you to upgrade MySQL to support utf8mb4, which is totally bullshit. (why the hell do people even presume utf8 is max 3 bytes?)
Because people either don't know anything outside of the BMP exists, or they think astral characters are only for dead languages (they haven't had the dawning realization about emoji yet), or they use a programming language like Java that accidentally implemented CESU-8 and called it "UTF8" a decade and a half ago and isn't allowed to fix it.
Another fun thing I ran into today is that Python regular expressions allow astral characters, but you can't safely use them until 3.3 because narrow builds will quietly replace them with nonsense that doesn't run (https://github.com/LuminosoInsight/python-ftfy/commit/86aa65...). And the very reason this came up was in a workaround for a different bug in 3.3.
well.... China's favorite encoding is GB, which encodes ascii values as one byte and chinese characters as two bytes. It's hard to see how UTF-8 (one byte for ascii, three for characters) would beat that, on the assumption that nearly 100% of what a chinese website would want to transmit is either ascii (where UTF-8 is equivalent to GB) or chinese (where it's inferior).
How do I know GB is preferred? I'm going off of three things:
Okay, you are right. The comparison I was thinking of was actually about UTF-16, and of course that's not actually preferred to GB or Shift-JIS or whatever.
You are tremendously wrong. Almost every legacy CJK encoding encodes a string in the smaller number of bytes than UTF-8 when the string in question has no unsupported characters in it. I have seen lots of (mostly misguided) people who prefer those legacy encodings over UTF-8/16 solely for this reason.
Storage space is cheap, and the price continues to fall. Storage of text is virtually nothing. Bandwidth to send text is almost nothing. Also, most text is compressed, which virtually eliminates that concern.
Programmer time is at least two orders of magnitude more expensive than storage space or bandwidth for text.
> Storage space is cheap, and the price continues to fall.
At-rest storage is cheap. Memory is cheaper than it used to be, but CPU cache is not. At some point the text will have to cross the CPU where every byte still counts.
You'd have to have rather weird data for it to be anywhere near 50% larger for real text (i.e. even if you only use Chinese, if you have punctuation, arabic numerals, quotes or URLs, HTML, etc. the averages cancel more than you might think) and a completely incompetent search engine design for that to remotely approach 50% more time to query or index.
If you were assigned the task of indexing the UTF-8 worst case corpus, nothing would stop you from designing a custom internal encoding while enjoying the many technical advantages UTF-8 gives you in every other area. Y internal details like compression are much easier to change than dealing with external interfaces which must be coordinated (this is why JavaScript still has such painful Unicode support even though browsers handle almost everything well in markup)
The difference is internal vs. external: if you use UTF-8 things simply work better any time you exchange data. If you had lots of high code points (>\U+8000) you would get better results in most cases using a full compression system rather than playing with 2-byte encoding hacks, which is really all you're talking about in both cases.
1. Controversy over Han unification made Unicode adoption less universal than might have been hoped.
2. Interoperability with legacy systems that don't use UTF-8 (for example, JavaScript). For example, Rust needs support for the full range of string encodings, because we need that support for implementing a browser engine.
I like the design of Python 3.3 encoding. ASCII takes 1 byte, BMP takes 2 bytes, everything else 4 bytes.
http://www.python.org/dev/peps/pep-0393/