Node's Unicode Dragon | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	Node's Unicode Dragon (cirw.in)
	94 points by foobar2k 4707 days ago

12 comments

stormbrew 4707 days ago

Wish I'd known about this when I was pointing out in another HN thread how utf-16 is a terrible encoding for, among other reasons, pushing the corner case where you find out your encoding/decoding is broken to the very edge of likelihood. It's ridiculous that v8 doesn't properly support utf16, but it's to be expected I suppose.

UTF-8 does not have this problem. That's the way we should be moving.

ender7 4706 days ago

This behavior is actually part of the ECMAScript standard [0], so it's unlikely that V8 (or any other conformant JS engine) would behave the way you (and many others) would want.

JS's treatment of strings is even more wacky than you might think -- it is neither really UCS-2 or UTF16. Engines are semi-required to use UTF-16 representations of strings internally, but the API surface that is exposed to the JS code makes them look like UCS-2 strings (i.e. no surrogate pairs). However, if you stick a JS string into something that is UTF-16 aware, such as a DOM node, then the surrogate pairs will display correctly.

See [1] for a very clear explanation of this muddy subject.

[0] http://www.ecma-international.org/ecma-262/5.1/#sec-8.4

[1] http://mathiasbynens.be/notes/javascript-encoding

stormbrew 4706 days ago

That is all incredibly depressing.

sillysaurus2 4706 days ago

This. Why doesn't everybody use UTF-8? Nobody seems to have any problems with UTF-8. It seems to work almost perfectly, and it's efficient.

est 4706 days ago

Because some of us are pissed that some BMP characters takes 3 bytes in UTF8, that's 50% more waste of storage space and 50% more time to read/write.

I like the design of Python 3.3 encoding. ASCII takes 1 byte, BMP takes 2 bytes, everything else 4 bytes.

http://www.python.org/dev/peps/pep-0393/

deathanatos 4706 days ago

The good point (in my opinion) is not that "ASCII takes 1 byte, BMP takes 2 bytes, everything else 4 bytes", but rather that the exposed API hides this from you, and exposes to you a sequence of code points. This, I hope, will reduce errors, as code points, not code units, is often a better abstraction to be working with. (For some random string processing function.)

So far as I know, Haskell is the only other language that I know of that exposes, as the defaultish-native interface, Unicode strings as a sequence or iterable of code points (by just using UTF-32). Java, C#, your-language-here all do code units. C++'s template are powerful enough that someone could make unicode_str<encoding_to_store_as>, but I've not seen one.

See: http://www.unicode.org/glossary/#code_point http://www.unicode.org/glossary/#code_unit

millstone 4706 days ago

Code points is a better abstraction than code units, but it's still a piss-poor abstraction.

Consider the problem of producing a valid substring from a Unicode string. It's important that you not split surrogate pairs, and it's true working with code points spares you from that particular problem. But it's also important that you not split combining marks, and zero width joiners, and Hangul syllables... (see http://www.unicode.org/reports/tr29/ for all the gory details).

An average programmer cannot correctly extract a substring from a Unicode string whether given the code units or the code points. These abstractions are inadequate: instead you want something like grapheme clusters.

pyre 4706 days ago

This was my reaction too. It's Unicode all the way down... :)

cmccabe 4705 days ago

Go allows you to iterate over a string as a series of code points.

stormbrew 4706 days ago

That's a reasonable replacement for ucs-4 for an internal representation, but it's not actually a character encoding like utf-8 and utf-16 are. It's just a tagged union of several encodings.

As for the inflation issue, 50% is just the absolute worst case. Many kinds of textual data include large amounts of code units that fit in one byte in utf-8 and 2 bytes in utf-16. It tends to even out somewhat. And if you really want your data to be small, gzip will do a better job than either.

est 4706 days ago

> 50% is just the absolute worst case. Many kinds of textual data include large amounts of code units that fit in one byte in utf-8

For latin alphabets, yes. For CJK, it's really bad. Things get worse if you dealt with non-BMP before, like iOS emoji, which force you to upgrade MySQL to support utf8mb4, which is totally bullshit. (why the hell do people even presume utf8 is max 3 bytes?)

rspeer 4706 days ago

Because people either don't know anything outside of the BMP exists, or they think astral characters are only for dead languages (they haven't had the dawning realization about emoji yet), or they use a programming language like Java that accidentally implemented CESU-8 and called it "UTF8" a decade and a half ago and isn't allowed to fix it.

One interesting conclusion from looking at the state of Twitter (http://blog.luminoso.com/2013/09/04/emoji-are-more-common-th...) is that CESU-8 is probably more common than real UTF-8.

Another fun thing I ran into today is that Python regular expressions allow astral characters, but you can't safely use them until 3.3 because narrow builds will quietly replace them with nonsense that doesn't run (https://github.com/LuminosoInsight/python-ftfy/commit/86aa65...). And the very reason this came up was in a workaround for a different bug in 3.3.

kps 4706 days ago

    > ... MySQL ...
    > why the hell do people even presume utf8 is max 3 bytes?

I think you answered your own question before you even asked.

rspeer 4706 days ago

Except most text isn't plain text. HTML pages in CJK are still smaller in UTF-8 than in their respective countries' favorite encodings.

sillysaurus2 4706 days ago

Storage space is cheap, and the price continues to fall. Storage of text is virtually nothing. Bandwidth to send text is almost nothing. Also, most text is compressed, which virtually eliminates that concern.

Programmer time is at least two orders of magnitude more expensive than storage space or bandwidth for text.

erichurkman 4706 days ago

> Storage space is cheap, and the price continues to fall.

At-rest storage is cheap. Memory is cheaper than it used to be, but CPU cache is not. At some point the text will have to cross the CPU where every byte still counts.

est 4706 days ago

> Storage space is cheap

True, but

1. time is precious. For example, you waste 50% more time for a fulltext indexing scan because utf8 is longer.

2. Memory. If you can't hold text in a single machine, you have bigger issues (e.g. clustering algorithms, persistency, redundancy, etc.)

3. Network transfer. If you can save 50% in a db connection rtt, you save a lot.

It makes no sense to save BMP in 3 bytes anyway.

acdha 4706 days ago

You'd have to have rather weird data for it to be anywhere near 50% larger for real text (i.e. even if you only use Chinese, if you have punctuation, arabic numerals, quotes or URLs, HTML, etc. the averages cancel more than you might think) and a completely incompetent search engine design for that to remotely approach 50% more time to query or index.

If you were assigned the task of indexing the UTF-8 worst case corpus, nothing would stop you from designing a custom internal encoding while enjoying the many technical advantages UTF-8 gives you in every other area. Y internal details like compression are much easier to change than dealing with external interfaces which must be coordinated (this is why JavaScript still has such painful Unicode support even though browsers handle almost everything well in markup)

pcwalton 4706 days ago

1. Controversy over Han unification made Unicode adoption less universal than might have been hoped.

2. Interoperability with legacy systems that don't use UTF-8 (for example, JavaScript). For example, Rust needs support for the full range of string encodings, because we need that support for implementing a browser engine.

millstone 4706 days ago

Did you read the article? The problem occurs precisely because V8 mishandles UTF-8.

Also check out the bug report: https://code.google.com/p/v8/issues/detail?id=2875

ximeng 4706 days ago

A lot of Windows is UTF-16 or UCS-2, including Office, which forces their use for working with APIs or transferring data.

millstone 4706 days ago

Why do you think that UTF-16's corner cases, by which you presumably mean surrogate pairs, are less likely than UTF-8's corner cases, like invalid code units and non-shortest forms?

I would argue that the UTF-8 corner cases are more rare because they are harder to produce accidentally, and also more serious because they have security implications.

justin_vanw 4706 days ago

Man, I'm starting to think there is a cult around JSON.

If you need to accept arbitrary binary data, JSON is a profoundly bad choice. At a minimum, you would expect them to base64 encode the data and put that into a JSON string.

If you are looking at error reports, how is it even remotely acceptable to have them silently modified to include invalid unicode replacement characters?

The lesson here isn't some crappy hack workaround they found, it's a case study in the lengths you'll have to go to when you insist on making technology choices without considering the problem you want to solve.

derefr 4706 days ago

Any wire-serialization format that wants to send arbitrary data should really have a "raw binary payload" type. XML has CDATA. ASN.1 has bitstrings. BERT has Binaries. But JSON doesn't really have anything like that.

I wonder... at some point, Javascript could get a convenient literal syntax for creating pre-filled ArrayBuffers, which would basically be the format JSON would want to adopt. But would it? Are changes to Javascript literal syntax folded into JSON, or is JSON now its own thing that doesn't track JS any more?

Dylan16807 4706 days ago

CDATA disallows null bytes, so it's even worse than non-support: illusory support

XML doesn't even allow escaped null bytes, so you're basically forced to use base64 or weird custom app-internal escapes.

JSON never tracked javascript. It has one version, period. But you could get people to adopt a superset with a new data type, if you kept it simple.

ygra 4706 days ago

Isn't CDATA character data anyway and thus not even binary but in the document's character set? Which makes it a poor choice for binary data even without taking into account that XML forbids certain characters.

As for binary data in web services ... isn't it easier to just use Content-Type for that and use the appropriate type for the payload? That wouldn't require a textual data format that can contain arbitrary binary data.

derefr 4706 days ago

That presumes you want to send a lot of binary data. Sometimes you just have five bytes or so. (E.g., as in the article, an un-decoded string.)

(Still, you're right, I admit to having been able to avoid any work painful enough to teach me XML arcana. I was actually thinking of one of the many variants of "Binary XML" I had read about recently, and assumed the typing was bijective to XML's own types. In other news, BSON of all things has a raw-binary type.)

baddox 4707 days ago

Despite that being a rather interesting technical article, I am upset that my expectation of an actual Unicode depiction of a dragon was not met.

greenyoda 4707 days ago

There is actually a Unicode dragon character at code point U+1F409:

http://www.fileformat.info/info/unicode/char/1f409/index.htm

Also, since any ASCII dragon is also a valid Unicode dragon (in UTF-8, at least), the following might satisfy your needs:

http://www.dougsartgallery.com/ascii-art-dragon.html

cirwin 4707 days ago

🐉

To see this dragon, either:

1. Use Safari or Firefox on OS X. 2. Install custom fonts for Linux or Windows. 3. Install https://chrome.google.com/webstore/detail/chromoji-emoji-for... for Chrome

pavlov 4707 days ago

The dragon glyph is rendered correctly in IE10 on Windows 8 without any custom fonts. Hooray for the most underestimated browser ever ;)

city41 4707 days ago

Also true of mobile IE10

Wilya 4707 days ago

Next time I have some "Here be dragons" code, I'm going to use this.

lelf 4706 days ago

There is also 🐲 U+1F432 DRAGON FACE

Also: didn't know that for every emoji there is https://en.wikipedia.org/wiki/🐉

nonchalance 4707 days ago

String encoding in general is a mess. Wait till you get to code pages. Incidentally, the largest JS script I've ever seen pertained to encoding and decoding characters under various codepages: https://raw.github.com/Niggler/js-codepage/master/cptable.js [github complains "(Sorry about that, but we can't show files that are this big right now.)"]

jrochkind1 4707 days ago

The OP describes an environment where data goes from node to Rails.

If you want to check a string for valid encoding and/or replace bad bytes with replacement char on the _ruby_ end... it's not very obvious how you do that with the ruby stdlib api, and it takes a few tricks to do right.

So I wrote a gem for it: https://github.com/jrochkind/ensure_valid_encoding

state 4707 days ago

Whew. This explains a bug from six months ago that drove me up the wall. I could never figure it out.

shawnz 4707 days ago

> Unfortunately for us, Javascript has never been updated to support UTF-16. Instead it continues to treat strings as UCS-2.

So really, they were parsing the JSON as if it were UTF-16, but really it was UCS-2. How is that an error in Node?

justincormack 4707 days ago

JSON is defined as UTF8, 16 or 32 [1]. The escaped characters are UTF-16 not UCS2. It is unfortunate of JavaScript can't parse it correctly!

[1] http://www.ietf.org/rfc/rfc4627.txt

kansface 4707 days ago

This is true of JSON, but its not true of Javascript which gives no fucks about utf16 (or valid surrogate pairs). Its a very strange world where JSON and Javascript have incompatible interpretations of strings.

http://mathiasbynens.be/notes/javascript-encoding

gnaritas 4707 days ago

Not really as JSON is not valid JavaScript and requires its own parser. It's based on JavaScript, but it is not JavaScript.

daxelrod 4706 days ago

I was skeptical, but I did some searching, and you appear to be right! The difference seems to come down to string handling:

http://timelessrepo.com/json-isnt-a-javascript-subset

gnaritas 4706 days ago

Ha, same article where I first learned this.

kansface 4707 days ago

They wanted to parse some bytes as utf-16, but are unable to do so because V8 only understands ucs2 (with invalid surrogate pairs). This is a major problem with node- ie, it happily produces/consumes invalid unicode encoded strings.

dsj36 4707 days ago

how did the error JSON include the undecodable bytes? JSON strings are all unicode sequences, so there would have had to be some way that the raw bytes were mapped into codepoints.

on the other hand, if the offending bytes were blindly substituted into the JSON, then it's not surprising that there were decoding issues down the line...

jlarocco 4707 days ago

From the article:

> The exceptions that were crashing us were caused by people using String.prototype.substr. That function works perfectly on strings that only contain Unicode 1.0 data, but as soon as you're storing UTF-16 in your UCS-2 string there's a possibility that when you take a slice you'll split a valid surrogate pair into two invalid lonely surrogates.

To me, it seems like it'd be nearly impossible for somebody to trigger, but there's always Murphy's law...

twoodfin 4706 days ago

These kinds of isolated surrogate pairs are pretty easy to create if you're doing the right kind of processing on the right kind of data.

Suppose you receive a long piece of text wrapped in JSON, unpack it into a JS String, then start processing it in fixed size chunks. If your source text contains any significant percentage of surrogate pair-represented characters, you'll eventually break one.

cirwin 4707 days ago

In the example I looked at to debug this, the sequence of events was:

1. One of our customer's javascript apps sent a truncated string to their web-server in a JSON payload. This string ended with a leading surrogate (this is another instance of V8 bug discussed in the blog post).

2. Their ruby backend exploded when they tried to use a regular expression on the string (because ruby's regexp library is strict about valid utf-8).

3. The bugsnag exception notifier copied the bytes from the incoming parameter into the JSON exception notification payload (ruby didn't notice because its string library unconditionally believes you if you tell it a string is valid utf8 — another bug :p).

sujayakar 4707 days ago

ah yeah step 3 seems pretty bad -- cool that you found that bug!

scoopr 4706 days ago

This same problem manifests with Java as well, where some methods that claim to return UTF-8 on closer inspection actually return “modified UTF-8”, which is broken the same way. Notably I ran across this in with JNI function GetStringUTFChars, but may come across in DataOutputStream's writeUTF etc.

bsaul 4706 days ago

Reminds me of a previous discussion about Go being more "mature" than node.js, where i said having someone like Pike on board gives you more than 30 years of "maturity". I'm pretty sure you wouldn't find those leaky UTF encoding handling in Go.

ygra 4706 days ago

Well, Node builds atop an established language, while Go is a new development. It's probably easier to build sane Unicode semantics into a new language than to change the JS spec.

pjscott 4706 days ago

Since Rob Pike and Ken Thompson are the guys who came up with UTF-8, you'd expect them to write decent Unicode encoding for Go. It would be surprising if they didn't.

scott_karana 4707 days ago

Is it just me, or is the two-column layout a bit tricky for readability?

(1440x900)

oceanstone 4707 days ago

I can't believe NodeJS doesn't support Dragon symbols. This is a dealbreaker.