| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by stormbrew 4662 days ago
	Wish I'd known about this when I was pointing out in another HN thread how utf-16 is a terrible encoding for, among other reasons, pushing the corner case where you find out your encoding/decoding is broken to the very edge of likelihood. It's ridiculous that v8 doesn't properly support utf16, but it's to be expected I suppose. UTF-8 does not have this problem. That's the way we should be moving.

3 comments

ender7 4662 days ago

This behavior is actually part of the ECMAScript standard [0], so it's unlikely that V8 (or any other conformant JS engine) would behave the way you (and many others) would want.

JS's treatment of strings is even more wacky than you might think -- it is neither really UCS-2 or UTF16. Engines are semi-required to use UTF-16 representations of strings internally, but the API surface that is exposed to the JS code makes them look like UCS-2 strings (i.e. no surrogate pairs). However, if you stick a JS string into something that is UTF-16 aware, such as a DOM node, then the surrogate pairs will display correctly.

See [1] for a very clear explanation of this muddy subject.

[0] http://www.ecma-international.org/ecma-262/5.1/#sec-8.4

[1] http://mathiasbynens.be/notes/javascript-encoding

stormbrew 4662 days ago

That is all incredibly depressing.

sillysaurus2 4662 days ago

This. Why doesn't everybody use UTF-8? Nobody seems to have any problems with UTF-8. It seems to work almost perfectly, and it's efficient.

est 4662 days ago

Because some of us are pissed that some BMP characters takes 3 bytes in UTF8, that's 50% more waste of storage space and 50% more time to read/write.

I like the design of Python 3.3 encoding. ASCII takes 1 byte, BMP takes 2 bytes, everything else 4 bytes.

http://www.python.org/dev/peps/pep-0393/

deathanatos 4662 days ago

The good point (in my opinion) is not that "ASCII takes 1 byte, BMP takes 2 bytes, everything else 4 bytes", but rather that the exposed API hides this from you, and exposes to you a sequence of code points. This, I hope, will reduce errors, as code points, not code units, is often a better abstraction to be working with. (For some random string processing function.)

So far as I know, Haskell is the only other language that I know of that exposes, as the defaultish-native interface, Unicode strings as a sequence or iterable of code points (by just using UTF-32). Java, C#, your-language-here all do code units. C++'s template are powerful enough that someone could make unicode_str<encoding_to_store_as>, but I've not seen one.

See: http://www.unicode.org/glossary/#code_point http://www.unicode.org/glossary/#code_unit

millstone 4662 days ago

Code points is a better abstraction than code units, but it's still a piss-poor abstraction.

Consider the problem of producing a valid substring from a Unicode string. It's important that you not split surrogate pairs, and it's true working with code points spares you from that particular problem. But it's also important that you not split combining marks, and zero width joiners, and Hangul syllables... (see http://www.unicode.org/reports/tr29/ for all the gory details).

An average programmer cannot correctly extract a substring from a Unicode string whether given the code units or the code points. These abstractions are inadequate: instead you want something like grapheme clusters.

pyre 4662 days ago

This was my reaction too. It's Unicode all the way down... :)

cmccabe 4661 days ago

Go allows you to iterate over a string as a series of code points.

stormbrew 4662 days ago

That's a reasonable replacement for ucs-4 for an internal representation, but it's not actually a character encoding like utf-8 and utf-16 are. It's just a tagged union of several encodings.

As for the inflation issue, 50% is just the absolute worst case. Many kinds of textual data include large amounts of code units that fit in one byte in utf-8 and 2 bytes in utf-16. It tends to even out somewhat. And if you really want your data to be small, gzip will do a better job than either.

est 4662 days ago

> 50% is just the absolute worst case. Many kinds of textual data include large amounts of code units that fit in one byte in utf-8

For latin alphabets, yes. For CJK, it's really bad. Things get worse if you dealt with non-BMP before, like iOS emoji, which force you to upgrade MySQL to support utf8mb4, which is totally bullshit. (why the hell do people even presume utf8 is max 3 bytes?)

rspeer 4662 days ago

Because people either don't know anything outside of the BMP exists, or they think astral characters are only for dead languages (they haven't had the dawning realization about emoji yet), or they use a programming language like Java that accidentally implemented CESU-8 and called it "UTF8" a decade and a half ago and isn't allowed to fix it.

One interesting conclusion from looking at the state of Twitter (http://blog.luminoso.com/2013/09/04/emoji-are-more-common-th...) is that CESU-8 is probably more common than real UTF-8.

Another fun thing I ran into today is that Python regular expressions allow astral characters, but you can't safely use them until 3.3 because narrow builds will quietly replace them with nonsense that doesn't run (https://github.com/LuminosoInsight/python-ftfy/commit/86aa65...). And the very reason this came up was in a workaround for a different bug in 3.3.

kps 4662 days ago

    > ... MySQL ...
    > why the hell do people even presume utf8 is max 3 bytes?

I think you answered your own question before you even asked.

rspeer 4662 days ago

Except most text isn't plain text. HTML pages in CJK are still smaller in UTF-8 than in their respective countries' favorite encodings.

thaumasiotes 4662 days ago

well.... China's favorite encoding is GB, which encodes ascii values as one byte and chinese characters as two bytes. It's hard to see how UTF-8 (one byte for ascii, three for characters) would beat that, on the assumption that nearly 100% of what a chinese website would want to transmit is either ascii (where UTF-8 is equivalent to GB) or chinese (where it's inferior).

How do I know GB is preferred? I'm going off of three things:

- According to wikipedia (http://en.wikipedia.org/wiki/GB18030), software sold in China is legally required to support it.

- I was once given a chinese ebook, which I had to figure out was in GB before I could read it. (And now, I know about chardet!)

- I worked with a chinese programmer who accidentally committed files in GB, even though they were supposed to be in UTF-8.

And since the latest GB can in fact represent any unicode point, it's hard to see why it wouldn't be preferred indefinitely.

lifthrasiir 4662 days ago

You are tremendously wrong. Almost every legacy CJK encoding encodes a string in the smaller number of bytes than UTF-8 when the string in question has no unsupported characters in it. I have seen lots of (mostly misguided) people who prefer those legacy encodings over UTF-8/16 solely for this reason.

est 4662 days ago

> still smaller in UTF-8 than in their respective countries' favorite encodings

How so? AFAIK Shift-JIS is ASCII compatible just like UTF8, so does other double byte encodings like BIG5 and GBK.

sillysaurus2 4662 days ago

Storage space is cheap, and the price continues to fall. Storage of text is virtually nothing. Bandwidth to send text is almost nothing. Also, most text is compressed, which virtually eliminates that concern.

Programmer time is at least two orders of magnitude more expensive than storage space or bandwidth for text.

erichurkman 4662 days ago

> Storage space is cheap, and the price continues to fall.

At-rest storage is cheap. Memory is cheaper than it used to be, but CPU cache is not. At some point the text will have to cross the CPU where every byte still counts.

est 4662 days ago

> Storage space is cheap

True, but

1. time is precious. For example, you waste 50% more time for a fulltext indexing scan because utf8 is longer.

2. Memory. If you can't hold text in a single machine, you have bigger issues (e.g. clustering algorithms, persistency, redundancy, etc.)

3. Network transfer. If you can save 50% in a db connection rtt, you save a lot.

It makes no sense to save BMP in 3 bytes anyway.

acdha 4662 days ago

You'd have to have rather weird data for it to be anywhere near 50% larger for real text (i.e. even if you only use Chinese, if you have punctuation, arabic numerals, quotes or URLs, HTML, etc. the averages cancel more than you might think) and a completely incompetent search engine design for that to remotely approach 50% more time to query or index.

If you were assigned the task of indexing the UTF-8 worst case corpus, nothing would stop you from designing a custom internal encoding while enjoying the many technical advantages UTF-8 gives you in every other area. Y internal details like compression are much easier to change than dealing with external interfaces which must be coordinated (this is why JavaScript still has such painful Unicode support even though browsers handle almost everything well in markup)

est 4662 days ago

> nothing would stop you from designing a custom internal encoding while enjoying the many technical advantages ...

That's exactly how those UTF-X and UCS-Y encodings were invented, right?

The point is, this beast is called unicode, how ironic.

pcwalton 4662 days ago

1. Controversy over Han unification made Unicode adoption less universal than might have been hoped.

2. Interoperability with legacy systems that don't use UTF-8 (for example, JavaScript). For example, Rust needs support for the full range of string encodings, because we need that support for implementing a browser engine.

millstone 4662 days ago

Did you read the article? The problem occurs precisely because V8 mishandles UTF-8.

Also check out the bug report: https://code.google.com/p/v8/issues/detail?id=2875

ximeng 4662 days ago

A lot of Windows is UTF-16 or UCS-2, including Office, which forces their use for working with APIs or transferring data.

millstone 4662 days ago

Why do you think that UTF-16's corner cases, by which you presumably mean surrogate pairs, are less likely than UTF-8's corner cases, like invalid code units and non-shortest forms?

I would argue that the UTF-8 corner cases are more rare because they are harder to produce accidentally, and also more serious because they have security implications.