Wish I'd known about this when I was pointing out in another HN thread how utf-16 is a terrible encoding for, among other reasons, pushing the corner case where you find out your encoding/decoding is broken to the very edge of likelihood. It's ridiculous that v8 doesn't properly support utf16, but it's to be expected I suppose.
UTF-8 does not have this problem. That's the way we should be moving.
This behavior is actually part of the ECMAScript standard [0], so it's unlikely that V8 (or any other conformant JS engine) would behave the way you (and many others) would want.
JS's treatment of strings is even more wacky than you might think -- it is neither really UCS-2 or UTF16. Engines are semi-required to use UTF-16 representations of strings internally, but the API surface that is exposed to the JS code makes them look like UCS-2 strings (i.e. no surrogate pairs). However, if you stick a JS string into something that is UTF-16 aware, such as a DOM node, then the surrogate pairs will display correctly.
See [1] for a very clear explanation of this muddy subject.
The good point (in my opinion) is not that "ASCII takes 1 byte, BMP takes 2 bytes, everything else 4 bytes", but rather that the exposed API hides this from you, and exposes to you a sequence of code points. This, I hope, will reduce errors, as code points, not code units, is often a better abstraction to be working with. (For some random string processing function.)
So far as I know, Haskell is the only other language that I know of that exposes, as the defaultish-native interface, Unicode strings as a sequence or iterable of code points (by just using UTF-32). Java, C#, your-language-here all do code units. C++'s template are powerful enough that someone could make unicode_str<encoding_to_store_as>, but I've not seen one.
Code points is a better abstraction than code units, but it's still a piss-poor abstraction.
Consider the problem of producing a valid substring from a Unicode string. It's important that you not split surrogate pairs, and it's true working with code points spares you from that particular problem. But it's also important that you not split combining marks, and zero width joiners, and Hangul syllables... (see http://www.unicode.org/reports/tr29/ for all the gory details).
An average programmer cannot correctly extract a substring from a Unicode string whether given the code units or the code points. These abstractions are inadequate: instead you want something like grapheme clusters.
That's a reasonable replacement for ucs-4 for an internal representation, but it's not actually a character encoding like utf-8 and utf-16 are. It's just a tagged union of several encodings.
As for the inflation issue, 50% is just the absolute worst case. Many kinds of textual data include large amounts of code units that fit in one byte in utf-8 and 2 bytes in utf-16. It tends to even out somewhat. And if you really want your data to be small, gzip will do a better job than either.
> 50% is just the absolute worst case. Many kinds of textual data include large amounts of code units that fit in one byte in utf-8
For latin alphabets, yes. For CJK, it's really bad. Things get worse if you dealt with non-BMP before, like iOS emoji, which force you to upgrade MySQL to support utf8mb4, which is totally bullshit. (why the hell do people even presume utf8 is max 3 bytes?)
Because people either don't know anything outside of the BMP exists, or they think astral characters are only for dead languages (they haven't had the dawning realization about emoji yet), or they use a programming language like Java that accidentally implemented CESU-8 and called it "UTF8" a decade and a half ago and isn't allowed to fix it.
Another fun thing I ran into today is that Python regular expressions allow astral characters, but you can't safely use them until 3.3 because narrow builds will quietly replace them with nonsense that doesn't run (https://github.com/LuminosoInsight/python-ftfy/commit/86aa65...). And the very reason this came up was in a workaround for a different bug in 3.3.
Storage space is cheap, and the price continues to fall. Storage of text is virtually nothing. Bandwidth to send text is almost nothing. Also, most text is compressed, which virtually eliminates that concern.
Programmer time is at least two orders of magnitude more expensive than storage space or bandwidth for text.
> Storage space is cheap, and the price continues to fall.
At-rest storage is cheap. Memory is cheaper than it used to be, but CPU cache is not. At some point the text will have to cross the CPU where every byte still counts.
You'd have to have rather weird data for it to be anywhere near 50% larger for real text (i.e. even if you only use Chinese, if you have punctuation, arabic numerals, quotes or URLs, HTML, etc. the averages cancel more than you might think) and a completely incompetent search engine design for that to remotely approach 50% more time to query or index.
If you were assigned the task of indexing the UTF-8 worst case corpus, nothing would stop you from designing a custom internal encoding while enjoying the many technical advantages UTF-8 gives you in every other area. Y internal details like compression are much easier to change than dealing with external interfaces which must be coordinated (this is why JavaScript still has such painful Unicode support even though browsers handle almost everything well in markup)
1. Controversy over Han unification made Unicode adoption less universal than might have been hoped.
2. Interoperability with legacy systems that don't use UTF-8 (for example, JavaScript). For example, Rust needs support for the full range of string encodings, because we need that support for implementing a browser engine.
Why do you think that UTF-16's corner cases, by which you presumably mean surrogate pairs, are less likely than UTF-8's corner cases, like invalid code units and non-shortest forms?
I would argue that the UTF-8 corner cases are more rare because they are harder to produce accidentally, and also more serious because they have security implications.
Man, I'm starting to think there is a cult around JSON.
If you need to accept arbitrary binary data, JSON is a profoundly bad choice. At a minimum, you would expect them to base64 encode the data and put that into a JSON string.
If you are looking at error reports, how is it even remotely acceptable to have them silently modified to include invalid unicode replacement characters?
The lesson here isn't some crappy hack workaround they found, it's a case study in the lengths you'll have to go to when you insist on making technology choices without considering the problem you want to solve.
Any wire-serialization format that wants to send arbitrary data should really have a "raw binary payload" type. XML has CDATA. ASN.1 has bitstrings. BERT has Binaries. But JSON doesn't really have anything like that.
I wonder... at some point, Javascript could get a convenient literal syntax for creating pre-filled ArrayBuffers, which would basically be the format JSON would want to adopt. But would it? Are changes to Javascript literal syntax folded into JSON, or is JSON now its own thing that doesn't track JS any more?
Isn't CDATA character data anyway and thus not even binary but in the document's character set? Which makes it a poor choice for binary data even without taking into account that XML forbids certain characters.
As for binary data in web services ... isn't it easier to just use Content-Type for that and use the appropriate type for the payload? That wouldn't require a textual data format that can contain arbitrary binary data.
That presumes you want to send a lot of binary data. Sometimes you just have five bytes or so. (E.g., as in the article, an un-decoded string.)
(Still, you're right, I admit to having been able to avoid any work painful enough to teach me XML arcana. I was actually thinking of one of the many variants of "Binary XML" I had read about recently, and assumed the typing was bijective to XML's own types. In other news, BSON of all things has a raw-binary type.)
String encoding in general is a mess. Wait till you get to code pages. Incidentally, the largest JS script I've ever seen pertained to encoding and decoding characters under various codepages: https://raw.github.com/Niggler/js-codepage/master/cptable.js [github complains "(Sorry about that, but we can't show files that are this big right now.)"]
The OP describes an environment where data goes from node to Rails.
If you want to check a string for valid encoding and/or replace bad bytes with replacement char on the _ruby_ end... it's not very obvious how you do that with the ruby stdlib api, and it takes a few tricks to do right.
This is true of JSON, but its not true of Javascript which gives no fucks about utf16 (or valid surrogate pairs). Its a very strange world where JSON and Javascript have incompatible interpretations of strings.
They wanted to parse some bytes as utf-16, but are unable to do so because V8 only understands ucs2 (with invalid surrogate pairs). This is a major problem with node- ie, it happily produces/consumes invalid unicode encoded strings.
how did the error JSON include the undecodable bytes? JSON strings are all unicode sequences, so there would have had to be some way that the raw bytes were mapped into codepoints.
on the other hand, if the offending bytes were blindly substituted into the JSON, then it's not surprising that there were decoding issues down the line...
> The exceptions that were crashing us were caused by people using String.prototype.substr. That function works perfectly on strings that only contain Unicode 1.0 data, but as soon as you're storing UTF-16 in your UCS-2 string there's a possibility that when you take a slice you'll split a valid surrogate pair into two invalid lonely surrogates.
To me, it seems like it'd be nearly impossible for somebody to trigger, but there's always Murphy's law...
These kinds of isolated surrogate pairs are pretty easy to create if you're doing the right kind of processing on the right kind of data.
Suppose you receive a long piece of text wrapped in JSON, unpack it into a JS String, then start processing it in fixed size chunks. If your source text contains any significant percentage of surrogate pair-represented characters, you'll eventually break one.
In the example I looked at to debug this, the sequence of events was:
1. One of our customer's javascript apps sent a truncated string to their web-server in a JSON payload. This string ended with a leading surrogate (this is another instance of V8 bug discussed in the blog post).
2. Their ruby backend exploded when they tried to use a regular expression on the string (because ruby's regexp library is strict about valid utf-8).
3. The bugsnag exception notifier copied the bytes from the incoming parameter into the JSON exception notification payload (ruby didn't notice because its string library unconditionally believes you if you tell it a string is valid utf8 — another bug :p).
This same problem manifests with Java as well, where some methods that claim to return UTF-8 on closer inspection actually return “modified UTF-8”, which is broken the same way. Notably I ran across this in with JNI function GetStringUTFChars, but may come across in DataOutputStream's writeUTF etc.
Reminds me of a previous discussion about Go being more "mature" than node.js, where i said having someone like Pike on board gives you more than 30 years of "maturity". I'm pretty sure you wouldn't find those leaky UTF encoding handling in Go.
Well, Node builds atop an established language, while Go is a new development. It's probably easier to build sane Unicode semantics into a new language than to change the JS spec.
Since Rob Pike and Ken Thompson are the guys who came up with UTF-8, you'd expect them to write decent Unicode encoding for Go. It would be surprising if they didn't.
UTF-8 does not have this problem. That's the way we should be moving.