| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by est 4667 days ago

Because some of us are pissed that some BMP characters takes 3 bytes in UTF8, that's 50% more waste of storage space and 50% more time to read/write.

I like the design of Python 3.3 encoding. ASCII takes 1 byte, BMP takes 2 bytes, everything else 4 bytes.

http://www.python.org/dev/peps/pep-0393/

3 comments

deathanatos 4667 days ago

The good point (in my opinion) is not that "ASCII takes 1 byte, BMP takes 2 bytes, everything else 4 bytes", but rather that the exposed API hides this from you, and exposes to you a sequence of code points. This, I hope, will reduce errors, as code points, not code units, is often a better abstraction to be working with. (For some random string processing function.)

So far as I know, Haskell is the only other language that I know of that exposes, as the defaultish-native interface, Unicode strings as a sequence or iterable of code points (by just using UTF-32). Java, C#, your-language-here all do code units. C++'s template are powerful enough that someone could make unicode_str<encoding_to_store_as>, but I've not seen one.

See: http://www.unicode.org/glossary/#code_point http://www.unicode.org/glossary/#code_unit

millstone 4667 days ago

Code points is a better abstraction than code units, but it's still a piss-poor abstraction.

Consider the problem of producing a valid substring from a Unicode string. It's important that you not split surrogate pairs, and it's true working with code points spares you from that particular problem. But it's also important that you not split combining marks, and zero width joiners, and Hangul syllables... (see http://www.unicode.org/reports/tr29/ for all the gory details).

An average programmer cannot correctly extract a substring from a Unicode string whether given the code units or the code points. These abstractions are inadequate: instead you want something like grapheme clusters.

pyre 4667 days ago

This was my reaction too. It's Unicode all the way down... :)

cmccabe 4666 days ago

Go allows you to iterate over a string as a series of code points.

stormbrew 4667 days ago

That's a reasonable replacement for ucs-4 for an internal representation, but it's not actually a character encoding like utf-8 and utf-16 are. It's just a tagged union of several encodings.

As for the inflation issue, 50% is just the absolute worst case. Many kinds of textual data include large amounts of code units that fit in one byte in utf-8 and 2 bytes in utf-16. It tends to even out somewhat. And if you really want your data to be small, gzip will do a better job than either.

est 4667 days ago

> 50% is just the absolute worst case. Many kinds of textual data include large amounts of code units that fit in one byte in utf-8

For latin alphabets, yes. For CJK, it's really bad. Things get worse if you dealt with non-BMP before, like iOS emoji, which force you to upgrade MySQL to support utf8mb4, which is totally bullshit. (why the hell do people even presume utf8 is max 3 bytes?)

rspeer 4667 days ago

Because people either don't know anything outside of the BMP exists, or they think astral characters are only for dead languages (they haven't had the dawning realization about emoji yet), or they use a programming language like Java that accidentally implemented CESU-8 and called it "UTF8" a decade and a half ago and isn't allowed to fix it.

One interesting conclusion from looking at the state of Twitter (http://blog.luminoso.com/2013/09/04/emoji-are-more-common-th...) is that CESU-8 is probably more common than real UTF-8.

Another fun thing I ran into today is that Python regular expressions allow astral characters, but you can't safely use them until 3.3 because narrow builds will quietly replace them with nonsense that doesn't run (https://github.com/LuminosoInsight/python-ftfy/commit/86aa65...). And the very reason this came up was in a workaround for a different bug in 3.3.

kps 4667 days ago

    > ... MySQL ...
    > why the hell do people even presume utf8 is max 3 bytes?

I think you answered your own question before you even asked.

rspeer 4667 days ago

Except most text isn't plain text. HTML pages in CJK are still smaller in UTF-8 than in their respective countries' favorite encodings.

thaumasiotes 4667 days ago

well.... China's favorite encoding is GB, which encodes ascii values as one byte and chinese characters as two bytes. It's hard to see how UTF-8 (one byte for ascii, three for characters) would beat that, on the assumption that nearly 100% of what a chinese website would want to transmit is either ascii (where UTF-8 is equivalent to GB) or chinese (where it's inferior).

How do I know GB is preferred? I'm going off of three things:

- According to wikipedia (http://en.wikipedia.org/wiki/GB18030), software sold in China is legally required to support it.

- I was once given a chinese ebook, which I had to figure out was in GB before I could read it. (And now, I know about chardet!)

- I worked with a chinese programmer who accidentally committed files in GB, even though they were supposed to be in UTF-8.

And since the latest GB can in fact represent any unicode point, it's hard to see why it wouldn't be preferred indefinitely.

rspeer 4663 days ago

Okay, you are right. The comparison I was thinking of was actually about UTF-16, and of course that's not actually preferred to GB or Shift-JIS or whatever.

lifthrasiir 4667 days ago

You are tremendously wrong. Almost every legacy CJK encoding encodes a string in the smaller number of bytes than UTF-8 when the string in question has no unsupported characters in it. I have seen lots of (mostly misguided) people who prefer those legacy encodings over UTF-8/16 solely for this reason.

est 4667 days ago

> still smaller in UTF-8 than in their respective countries' favorite encodings

How so? AFAIK Shift-JIS is ASCII compatible just like UTF8, so does other double byte encodings like BIG5 and GBK.

sillysaurus2 4667 days ago

Storage space is cheap, and the price continues to fall. Storage of text is virtually nothing. Bandwidth to send text is almost nothing. Also, most text is compressed, which virtually eliminates that concern.

Programmer time is at least two orders of magnitude more expensive than storage space or bandwidth for text.

erichurkman 4667 days ago

> Storage space is cheap, and the price continues to fall.

At-rest storage is cheap. Memory is cheaper than it used to be, but CPU cache is not. At some point the text will have to cross the CPU where every byte still counts.

est 4667 days ago

> Storage space is cheap

True, but

1. time is precious. For example, you waste 50% more time for a fulltext indexing scan because utf8 is longer.

2. Memory. If you can't hold text in a single machine, you have bigger issues (e.g. clustering algorithms, persistency, redundancy, etc.)

3. Network transfer. If you can save 50% in a db connection rtt, you save a lot.

It makes no sense to save BMP in 3 bytes anyway.

acdha 4667 days ago

You'd have to have rather weird data for it to be anywhere near 50% larger for real text (i.e. even if you only use Chinese, if you have punctuation, arabic numerals, quotes or URLs, HTML, etc. the averages cancel more than you might think) and a completely incompetent search engine design for that to remotely approach 50% more time to query or index.

If you were assigned the task of indexing the UTF-8 worst case corpus, nothing would stop you from designing a custom internal encoding while enjoying the many technical advantages UTF-8 gives you in every other area. Y internal details like compression are much easier to change than dealing with external interfaces which must be coordinated (this is why JavaScript still has such painful Unicode support even though browsers handle almost everything well in markup)

est 4667 days ago

> nothing would stop you from designing a custom internal encoding while enjoying the many technical advantages ...

That's exactly how those UTF-X and UCS-Y encodings were invented, right?

The point is, this beast is called unicode, how ironic.

acdha 4667 days ago

The difference is internal vs. external: if you use UTF-8 things simply work better any time you exchange data. If you had lots of high code points (>\U+8000) you would get better results in most cases using a full compression system rather than playing with 2-byte encoding hacks, which is really all you're talking about in both cases.

est 4667 days ago

I totally agree that UTF-8 is pretty good at exchanging data because UTF8 is better than UCS-* and other UTF-* overall, and because everyone is (and should be) using it.

But you know, there are other cases besides exchanging. Like I said, if your text data is mainly latin you are good, but not so good if you are stuck with non-latin BMPs.