| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dsj36 4708 days ago
	how did the error JSON include the undecodable bytes? JSON strings are all unicode sequences, so there would have had to be some way that the raw bytes were mapped into codepoints. on the other hand, if the offending bytes were blindly substituted into the JSON, then it's not surprising that there were decoding issues down the line...

2 comments

jlarocco 4708 days ago

From the article:

> The exceptions that were crashing us were caused by people using String.prototype.substr. That function works perfectly on strings that only contain Unicode 1.0 data, but as soon as you're storing UTF-16 in your UCS-2 string there's a possibility that when you take a slice you'll split a valid surrogate pair into two invalid lonely surrogates.

To me, it seems like it'd be nearly impossible for somebody to trigger, but there's always Murphy's law...

link

twoodfin 4707 days ago

These kinds of isolated surrogate pairs are pretty easy to create if you're doing the right kind of processing on the right kind of data.

Suppose you receive a long piece of text wrapped in JSON, unpack it into a JS String, then start processing it in fixed size chunks. If your source text contains any significant percentage of surrogate pair-represented characters, you'll eventually break one.

link

cirwin 4708 days ago

In the example I looked at to debug this, the sequence of events was:

1. One of our customer's javascript apps sent a truncated string to their web-server in a JSON payload. This string ended with a leading surrogate (this is another instance of V8 bug discussed in the blog post).

2. Their ruby backend exploded when they tried to use a regular expression on the string (because ruby's regexp library is strict about valid utf-8).

3. The bugsnag exception notifier copied the bytes from the incoming parameter into the JSON exception notification payload (ruby didn't notice because its string library unconditionally believes you if you tell it a string is valid utf8 — another bug :p).

link

sujayakar 4708 days ago

ah yeah step 3 seems pretty bad -- cool that you found that bug!

link