Great, after the tag soup of modern browsers are we now also going to see json soup?
Sometimes it's obvious what's wrong with malformed data you receive. A classic would be encoding errors.
But as soon as you start supporting broken components and APIs, you will never be able to unsupport it.
Prime example would be HTML. Granted, in the beginning, it was supposed to be written by humans but that was rather quickly not a major obstacle anymore and even a human can produce valid HTML with the help of a syntax checker.
I've written a relatively popular Atom/RSS feed parser for Go [0].
I struggled with this very issue but I ultimately ended up attempting to be robust against out-of-spec feeds. A super strict feed parsing library is less useful than one that can successfully parse certain classes of broken feeds.
It is a fine line to walk -- I won't add a great deal of complexity to support overly broken feeds, but if it is relatively simple to support certain types of common mistakes I'll do it.
I'm doing this with WebDAV too. When I come across a bug that's clearly an implementation problem I weigh how prevalent the software is, how likely they will be able to fix it and if possible I add a user-agent specific workaround so new clients can't rely on the same bug with my server.
That nightmare had to do with misbehaving servers. IE had to advertise as Mozilla so servers would serve the better response.
In this case it would be possible for a client to fake a UA, but it's more likely that they weren't aware they were doing things incorrectly and correct the behavior rather than opting in to mimicing a different UA to get the server to behave in a non-standard way.
I haven't seen this happen, and this is one of the most popular DAV implementations. I have seen people fix broken implementations as I've slowly been making the server more strict over the last 10 years.
I know that, a long time ago I wrote an HTML parser that tried to make the most sense out of any HTML you threw at it. At one point, it was used to parse most of the Chinese websites there were at the time to find neologisms.
So it was pretty robust but yeah, somewhere you should draw the line.
I think, as long as it doesn't compromise the design of your program (for example, parsing rfc822 dates with localized weekdays) it's fine to be a bit lenient in what you accept.
Anything that goes beyond, needs a very good reason.
Please don't do things like this. It only encourages people to be lazy about producing conforming documents, and different parsers that try to compensate for syntax errors are going to do so in different ways. We learnt this the hard way with HTML.
I'm not quite seeing who you think would be encouraged here. Bad JSON output is usually created in a rush by someone who didn't test their output. It's unlikely that someone who does test their JSON output would become lazy because a few lenient parsers exist.
Many people here wondering how you can end up with JSON this bad, and who is "sending" it to them. Well, the poster is not neccessarily running a REST service. At work, I've dealt with plenty of little JSON (and XML) files, created by "little tools" and passed around via files and pipes. Since I work in science, most of our coders are the users of their code, so you can imagine both code quality and UX are poor. And the main reason something like this happens is that people don't use proper serialization, because they never heard of it, or they don't have the right tools. They just construct JSON by string interpolation. If they are lucky, they remember to replace `'` with `\"`. In fact, that looks a lot like what happened here (plus one or two levels of escaping).
Appropos escaping, people are most likely to do this if they never wrote PHP websites as kids and never went through `urlencode` and `mysql_escape_string` hell.
I'm on my phone now, but later today I'll test to see if it would have worked for the author. It's good for cleaning up JSON, but I would be weary of putting it (or anything like it) anywhere near production.
I'm hoping nobody actually does this in production. As an academic exercise it is interesting.
Maybe I'm old fashioned - I'm all for flexible APIs and all, but to its point. If a customer sends rotten stuff, it should just be rejected with a 40x code.
At minimum, check to make sure it is proper JSON... I know that a lot of stream processors will put it into a queue and 200 right away and then process in the background, but I don't think that ensuring it is at least JSON and doesn't have a content size of more than X could be too intensive.
In this case, if the data was already accepted and you've got no choice but to deal with it, you've gotta do what you got to do. I've been there, and it ain't fun cleaning up a 900 GB JSON file.
> Maybe I'm old fashioned - I'm all for flexible APIs and all, but to its point. If a customer sends rotten stuff, it should just be rejected with a 40x code.
In some fields, that's not an option. I do NMS engineering. If I need to set up monitoring for something, and the only source of the diagnostics I need is an endpoint that returns malformed JSON, I can't just throw my hands up and say "the data's in a shit format, I won't touch it". I'll have no choice but to get my hands dirty and parse out whatever I can because our systems need to be monitored.
I'm lucky in that the only times I had to deal with malformed JSON at this job, I was able to fix the program that was generating it because it was maintained by my team (the problem was that it was snarfing data from a database and sending it out as JSON but forgetting to escape tab characters, and unescaped tabs aren't allowed in JSON), but my luck's gonna run out some day.
I got one for you. We have to upload json files containing for a bunch of articles some encoded rules, and the legal text in the law why the encoded rules are what they are.
The law part was supposed to be a few lines of text. Except when they dont know which article to give. In that case they provide the full law text, including scanned pdfs, base64 encoded. All 2GB of it. Basically you have something with the meaning null, encoded in a huge string.
Now the creation of this file was given to a third party, who don't bother with finding out the relevant law, and paste the 2GB blob into every article they modify, just to be sure. At this point we have 500 000 articles in that file. We get a new one every month.
Not fun at all. But it is modern, at least, in the past it was a cobol flat file.
This reminds me "Parsing JSON is a minefield" post a few weeks ago, TL;DR, JSON is not standardized or (having multiple standards) making parsing / validating JSON data very tricky in edge cases.
Read the original article that I linked in the HN discussion, skip through to the section where it says "Yet JSON is defined in at least six different documents". You're welcome.
How well do you know the sender? Because this looks like an attack, or at least a probe: something to try and crash the parser and see what response they get back, to see if you are vulnerable to some kind of heap corruption attack.
Malformed data is a scalability problem. Unusual failure modes from coding problems to random bit flips become inevitable as the data volume approaches infinity.
Agreed, but the payload from the article doesn't seem to have suffered from astral radiation. Rather, random attempts at quote-escaping by someone who doesn't understand what they're doing. Also notice the "nan" value -- JSON has no concept of NaN.
But to be clear, error correction should be done at a level far lower than the parsing stage. It's usually a property of the storage medium or the firmware that accesses it.
If you have to correct for bit flips when you begin to read or parse data, it's too late.
my first reaction would be to parse until you hit a problem. then use a string distance function and a genetic algorithm to find the problematic characters.
in other words. find multiple possibilities that result in valid a json object and choose the one with the shortest distance.
then, of course log out the changes.
I do something similar with csvs. mssql is notorious for spitting out junk inside csv files.
also, i can guess how it was created.
the code is probably in c, and a rare edge case is overwriting memory before it hits the file.
It's a neat trick, but not something I'd deploy into production. If I have to try to guess at what the customer is sending me, I'm not going to apply it to their account.
In an emergency, I might hand-edit it and make it right, but I'd absolutely insist that further files be in the correct format.
That's a legit use for it, sure. But when a non-techie sees something like this, they immediately think of all the hassle they can save a customer that is having trouble making valid JSON. "We'll just parse it for them!" They completely ignore that it's not possible to know for sure what the customer really wanted, and it's the start of a lot of headaches.
It's not just the extra compute, it's the lack of a formal specification. If different services applied this kind of ad hoc "postel's principle" they may parse the malformed markup differently, and end up introducing downstream inconsistencies.
Or even vulnerabilities. Imagine a scenario where a parser for an authentication engine reads a different value for a given key than the value the authorization logic reads.
This isn't theoretical, I've seen it with HTTP, HTML and elsewhere. Any time two pieces of software disagree on how to parse a chunk of data, especially if one of them is supposed to be doing some sort of security check, you should expect to find a vulnerability lurking.
I don't know if there's a name for this class of problem. I'd be interested to know.
Another nice feature of the built-in JSON library is that you can choose what class to instantiate with the data. The default is a dict, but if you're trying to parse Avro records (or something else that cares about field order), you can change that to an OrderedDict.
I think nom-nom is trying to imply that if you're passing GBs of Json around, "human readability" isn't probably a concern. Therefore you could go for an efficient binary format.
JSON isn't just about human readability, it's about being a 'good enough' standard for data exchange. What binary format would you use that people could parse as reliably as JSON?
In case your question wasn't rhetorical, I believe MessagePack is the leading schema-less binary serialization format (which does not contradict your point as it is still less ubiquitous than JSON).
Maps. [0] Geo-cordinate data can consist of tens of thousands of data points. For example, think of a two dimensional space with a co-ordinate grid at regular intervals representing a 20km x 20km city. Then imagine creating an outline of a city road network. Each point a LAT/LON co-ordinate. Then imagine placing thematic data such as known traffic hot spots.
Wouldn't a better option be an error log? you reply to the client that "I can accept 398,500 of your 400,000 submitted records, attached are the records that do not conform to the expected template. Choose either to (1) submit only the validated records and discard the malformed ones or (2) reformat the malformed records and resubmit the entire batch"
> JSON should have a nicer way of dealing with double quotes in data. That would avoid many encoding mistakes.
So you update the standard to this nicer way of dealing with double quotes, and now people forget to indicate whether they're using the nice new way or the ugly old way, or they mix the two approaches ….
Sometimes it's obvious what's wrong with malformed data you receive. A classic would be encoding errors.
But as soon as you start supporting broken components and APIs, you will never be able to unsupport it.
Prime example would be HTML. Granted, in the beginning, it was supposed to be written by humans but that was rather quickly not a major obstacle anymore and even a human can produce valid HTML with the help of a syntax checker.