Parsing malformed JSON

Y	Hacker News new \| ask \| show \| jobs

	Parsing malformed JSON (peteris.rocks)
	66 points by p8donald 3504 days ago

22 comments

bhaak 3504 days ago

Great, after the tag soup of modern browsers are we now also going to see json soup?

Sometimes it's obvious what's wrong with malformed data you receive. A classic would be encoding errors.

But as soon as you start supporting broken components and APIs, you will never be able to unsupport it.

Prime example would be HTML. Granted, in the beginning, it was supposed to be written by humans but that was rather quickly not a major obstacle anymore and even a human can produce valid HTML with the help of a syntax checker.

link

drakenot 3504 days ago

I've written a relatively popular Atom/RSS feed parser for Go [0].

I struggled with this very issue but I ultimately ended up attempting to be robust against out-of-spec feeds. A super strict feed parsing library is less useful than one that can successfully parse certain classes of broken feeds.

It is a fine line to walk -- I won't add a great deal of complexity to support overly broken feeds, but if it is relatively simple to support certain types of common mistakes I'll do it.

[0] https://github.com/mmcdole/gofeed

link

treve 3504 days ago

I'm doing this with WebDAV too. When I come across a bug that's clearly an implementation problem I weigh how prevalent the software is, how likely they will be able to fix it and if possible I add a user-agent specific workaround so new clients can't rely on the same bug with my server.

link

kr0 3504 days ago

But then we add the IE-nightmare of using an accepted user-agent in a new product to workaround cases like this

link

treve 3503 days ago

That nightmare had to do with misbehaving servers. IE had to advertise as Mozilla so servers would serve the better response.

In this case it would be possible for a client to fake a UA, but it's more likely that they weren't aware they were doing things incorrectly and correct the behavior rather than opting in to mimicing a different UA to get the server to behave in a non-standard way.

I haven't seen this happen, and this is one of the most popular DAV implementations. I have seen people fix broken implementations as I've slowly been making the server more strict over the last 10 years.

link

markrages 3504 days ago

Nothing new under the sun:

http://www.xml.com/pub/a/2003/01/22/dive-into-xml.html

link

drakenot 3504 days ago

I read those threads while I was first starting to write my parser.

I found it interesting, if you look in the thread you'll see that this was a big disagreement between Pilgrim and Aaron Swartz.

link

bhaak 3504 days ago

I know that, a long time ago I wrote an HTML parser that tried to make the most sense out of any HTML you threw at it. At one point, it was used to parse most of the Chinese websites there were at the time to find neologisms.

So it was pretty robust but yeah, somewhere you should draw the line.

I think, as long as it doesn't compromise the design of your program (for example, parsing rfc822 dates with localized weekdays) it's fine to be a bit lenient in what you accept.

Anything that goes beyond, needs a very good reason.

link

peterkelly 3504 days ago

Please don't do things like this. It only encourages people to be lazy about producing conforming documents, and different parsers that try to compensate for syntax errors are going to do so in different ways. We learnt this the hard way with HTML.

link

mwkaufma 3504 days ago

Exactly. When you design for sloppy inputs, you're vulnerable to malicious inputs.

link

ludamad 3504 days ago

I'm not quite seeing who you think would be encouraged here. Bad JSON output is usually created in a rush by someone who didn't test their output. It's unlikely that someone who does test their JSON output would become lazy because a few lenient parsers exist.

link

hueving 3504 days ago

Once there are parsers accepting bad input, people will inevitably test with those parsers and assume their output is okay.

link

captainmuon 3504 days ago

Many people here wondering how you can end up with JSON this bad, and who is "sending" it to them. Well, the poster is not neccessarily running a REST service. At work, I've dealt with plenty of little JSON (and XML) files, created by "little tools" and passed around via files and pipes. Since I work in science, most of our coders are the users of their code, so you can imagine both code quality and UX are poor. And the main reason something like this happens is that people don't use proper serialization, because they never heard of it, or they don't have the right tools. They just construct JSON by string interpolation. If they are lucky, they remember to replace `'` with `\"`. In fact, that looks a lot like what happened here (plus one or two levels of escaping).

Appropos escaping, people are most likely to do this if they never wrote PHP websites as kids and never went through `urlencode` and `mysql_escape_string` hell.

link

RMarcus 3504 days ago

I wrote a library to handle (many cases) of invalid JSON, motivated by a similar experience. https://github.com/RyanMarcus/dirty-json

I'm on my phone now, but later today I'll test to see if it would have worked for the author. It's good for cleaning up JSON, but I would be weary of putting it (or anything like it) anywhere near production.

link

k2xl 3504 days ago

I'm hoping nobody actually does this in production. As an academic exercise it is interesting.

Maybe I'm old fashioned - I'm all for flexible APIs and all, but to its point. If a customer sends rotten stuff, it should just be rejected with a 40x code.

At minimum, check to make sure it is proper JSON... I know that a lot of stream processors will put it into a queue and 200 right away and then process in the background, but I don't think that ensuring it is at least JSON and doesn't have a content size of more than X could be too intensive.

In this case, if the data was already accepted and you've got no choice but to deal with it, you've gotta do what you got to do. I've been there, and it ain't fun cleaning up a 900 GB JSON file.

link

amyjess 3504 days ago

> Maybe I'm old fashioned - I'm all for flexible APIs and all, but to its point. If a customer sends rotten stuff, it should just be rejected with a 40x code.

In some fields, that's not an option. I do NMS engineering. If I need to set up monitoring for something, and the only source of the diagnostics I need is an endpoint that returns malformed JSON, I can't just throw my hands up and say "the data's in a shit format, I won't touch it". I'll have no choice but to get my hands dirty and parse out whatever I can because our systems need to be monitored.

I'm lucky in that the only times I had to deal with malformed JSON at this job, I was able to fix the program that was generating it because it was maintained by my team (the problem was that it was snarfing data from a database and sending it out as JSON but forgetting to escape tab characters, and unescaped tabs aren't allowed in JSON), but my luck's gonna run out some day.

link

junke 3504 days ago

I don't deal with such huge files. Honestly, what use case requires 900GB of JSON?

link

hyperman1 3504 days ago

I got one for you. We have to upload json files containing for a bunch of articles some encoded rules, and the legal text in the law why the encoded rules are what they are.

The law part was supposed to be a few lines of text. Except when they dont know which article to give. In that case they provide the full law text, including scanned pdfs, base64 encoded. All 2GB of it. Basically you have something with the meaning null, encoded in a huge string.

Now the creation of this file was given to a third party, who don't bother with finding out the relevant law, and paste the 2GB blob into every article they modify, just to be sure. At this point we have 500 000 articles in that file. We get a new one every month.

Not fun at all. But it is modern, at least, in the past it was a cobol flat file.

link

junke 3504 days ago

This looks like TheDailyWTF.com, but thanks.

link

devy 3504 days ago

This reminds me "Parsing JSON is a minefield" post a few weeks ago, TL;DR, JSON is not standardized or (having multiple standards) making parsing / validating JSON data very tricky in edge cases.

https://news.ycombinator.com/item?id=12796556

link

beejiu 3504 days ago

What are the multiple standards of JSON? I am only aware of one standard; it is the implementations that are the problem.

link

devy 3504 days ago

Read the original article that I linked in the HN discussion, skip through to the section where it says "Yet JSON is defined in at least six different documents". You're welcome.

link

Analemma_ 3504 days ago

How well do you know the sender? Because this looks like an attack, or at least a probe: something to try and crash the parser and see what response they get back, to see if you are vulnerable to some kind of heap corruption attack.

link

aikah 3504 days ago

> I have no idea how something like this was generated.

It would be interesting to ask the sender how .

> If the file is small enough or the data regular enough, you could fix it by hand with some search & replace.

off course.

> But the file I had was gigabytes in size and most of it looked fine.

I suspect a faulty JSON library, it's important to figure out how it was generated so an eventual issue can be opened and the bug can be fixed.

link

junke 3504 days ago

> I had this "JSON" file sent to me

Why? By whom? Did you complain loudly?

link

PaulHoule 3504 days ago

Malformed data is a scalability problem. Unusual failure modes from coding problems to random bit flips become inevitable as the data volume approaches infinity.

link

thwd 3504 days ago

Agreed, but the payload from the article doesn't seem to have suffered from astral radiation. Rather, random attempts at quote-escaping by someone who doesn't understand what they're doing. Also notice the "nan" value -- JSON has no concept of NaN.

link

colanderman 3504 days ago

Yes, and that's a problem to be solved at the transport and storage layers, not the application layer.

link

reikonomusha 3504 days ago

But to be clear, error correction should be done at a level far lower than the parsing stage. It's usually a property of the storage medium or the firmware that accesses it.

If you have to correct for bit flips when you begin to read or parse data, it's too late.

link

mSparks 3504 days ago

my first reaction would be to parse until you hit a problem. then use a string distance function and a genetic algorithm to find the problematic characters.

in other words. find multiple possibilities that result in valid a json object and choose the one with the shortest distance.

then, of course log out the changes.

I do something similar with csvs. mssql is notorious for spitting out junk inside csv files.

also, i can guess how it was created.

the code is probably in c, and a rare edge case is overwriting memory before it hits the file.

link

wccrawford 3504 days ago

It's a neat trick, but not something I'd deploy into production. If I have to try to guess at what the customer is sending me, I'm not going to apply it to their account.

In an emergency, I might hand-edit it and make it right, but I'd absolutely insist that further files be in the correct format.

link

k__ 3504 days ago

Isn't this used mainly in editors that want to provide some hints even for JSON you didn't finished yet.

link

wccrawford 3503 days ago

That's a legit use for it, sure. But when a non-techie sees something like this, they immediately think of all the hassle they can save a customer that is having trouble making valid JSON. "We'll just parse it for them!" They completely ignore that it's not possible to know for sure what the customer really wanted, and it's the start of a lot of headaches.

link

mwkaufma 3504 days ago

Or, how I made my service a DDoS target.

It's not just the extra compute, it's the lack of a formal specification. If different services applied this kind of ad hoc "postel's principle" they may parse the malformed markup differently, and end up introducing downstream inconsistencies.

link

hueving 3504 days ago

Or even vulnerabilities. Imagine a scenario where a parser for an authentication engine reads a different value for a given key than the value the authorization logic reads.

link

brassic 3504 days ago

This isn't theoretical, I've seen it with HTTP, HTML and elsewhere. Any time two pieces of software disagree on how to parse a chunk of data, especially if one of them is supposed to be doing some sort of security check, you should expect to find a vulnerability lurking.

I don't know if there's a name for this class of problem. I'd be interested to know.

link

latch 3504 days ago

Not a python developer so I was surprised when the built-in json library has a flag allow_nan which is True by default.

Also, not invalid, but surprising / annoying (a while to debug). An empty lua table is the same as an empty lua array: {}. This causes ambiguity.

    // will print {}
    print(cjson.encode(cjson.decode('[]')))

link

amyjess 3504 days ago

Another nice feature of the built-in JSON library is that you can choose what class to instantiate with the data. The default is a dict, but if you're trying to parse Avro records (or something else that cares about field order), you can change that to an OrderedDict.

link

anentropic 3504 days ago

just send the file back where it came from

link

nommm-nommm 3504 days ago

Why would a JSON file be GBs in size? I think that's the more interesting question.

link

nilved 3504 days ago

Because it has GBs of data? There's no size limit on JSON.

link

mertd 3504 days ago

I think nom-nom is trying to imply that if you're passing GBs of Json around, "human readability" isn't probably a concern. Therefore you could go for an efficient binary format.

link

ludamad 3504 days ago

JSON isn't just about human readability, it's about being a 'good enough' standard for data exchange. What binary format would you use that people could parse as reliably as JSON?

link

ezrast 3503 days ago

In case your question wasn't rhetorical, I believe MessagePack is the leading schema-less binary serialization format (which does not contradict your point as it is still less ubiquitous than JSON).

link

bootload 3504 days ago

"Why would a JSON file be GBs in size?"

Maps. [0] Geo-cordinate data can consist of tens of thousands of data points. For example, think of a two dimensional space with a co-ordinate grid at regular intervals representing a 20km x 20km city. Then imagine creating an outline of a city road network. Each point a LAT/LON co-ordinate. Then imagine placing thematic data such as known traffic hot spots.

Lots of data.

[0] The author has this post in his blog ~ https://peteris.rocks/blog/openstreetmap-city-blocks-as-geoj...

link

nommm-nommm 3503 days ago

Thank you! I was really racking my brain trying to think of a use case that produced that much JSON.

link

agounaris 3504 days ago

Wasn't easier to just remove the wrong characters manually? :P

Validate the json and if its wrong just throw it away. Makes no sense trying to fix/guess the correct form of an input.

link

nkrisc 3504 days ago

Should you really assume malformed JSON is even correct?

link

ekiara 3503 days ago

Wouldn't a better option be an error log? you reply to the client that "I can accept 398,500 of your 400,000 submitted records, attached are the records that do not conform to the expected template. Choose either to (1) submit only the validated records and discard the malformed ones or (2) reformat the malformed records and resubmit the entire batch"

link

ape4 3504 days ago

JSON should have a nicer way of dealing with double quotes in data. That would avoid many encoding mistakes.

link

JadeNB 3504 days ago

> JSON should have a nicer way of dealing with double quotes in data. That would avoid many encoding mistakes.

So you update the standard to this nicer way of dealing with double quotes, and now people forget to indicate whether they're using the nice new way or the ugly old way, or they mix the two approaches ….

link

ape4 3502 days ago

It would have to be phased in ... like html5 or any browser improvement.

link

fbreduc 3504 days ago

i don't get malformed json, i tell the sender to re-send data as json

link

bborud 3504 days ago

Don't.

link