Hacker News new | ask | show | jobs
by pimlottc 3327 days ago
> JSON is simpler to read and write, and it’s less prone to bugs.

Less prone to bugs? How's that?

5 comments

Consider XML entity bombs. You need to explicitly tell your XML parser not to follow the spec to prevent malicious sources of XML from crashing your application. XML also has a lot of room for syntax errors, with many types of tokens and escape rules. JSON, by comparison, does not.
> XML also has a lot of room for syntax errors, with many types of tokens and escape rules. JSON, by comparison, does not.

Parsing JSON is a minefield.

Yellow and light blue boxes highlight the worst situations for applications using the specified parser. Take a look at how a bunch of parsers perform with various payloads: http://seriot.ch/json/pruned_results.png

"JSON is the de facto standard when it comes to (un)serialising and exchanging data in web and mobile programming. But how well do you really know JSON? We'll read the specifications and write test cases together. We'll test common JSON libraries against our test cases. I'll show that JSON is not the easy, idealised format as many do believe. Indeed, I did not find two libraries that exhibit the very same behaviour. Moreover, I found that edge cases and maliciously crafted payloads can cause bugs, crashes and denial of services, mainly because JSON libraries rely on specifications that have evolved over time and that left many details loosely specified or not specified at all."

More details available at: http://seriot.ch/parsing_json.php

None of these issues are as bad as the XML ones. You generally don't need "defusedjson" like you need https://pypi.python.org/pypi/defusedxml

<!DOCTYPE external [ <!ENTITY ee SYSTEM "file:///etc/ssh/ssh_host_ed25519_key"> ]> <root>&ee;</root>

Parser correctness is irrelevant when you're talking about the ability to be written with few syntax errors. For instance, JSON has one type of string with one set of string escape rules. XML has element names, attribute names, attribute values, text nodes, CDATA content, RCDATA content, and more. And almost all of them have different rules for what they can contain and how they can be used.

By comparison, XML is orders of magnitude more complex than JSON.

> XML also has a lot of room for syntax errors,

No it doesn't. XML is either well formed or not, and any parser encountering non well-formed XML will reject it outright.

Therefor all XML in use on the internet is spec-compliant.

Now try to say the same about JSON.

> any parser encountering non well-formed XML will reject it outright.

Ah, I see you're new to parsing XML.

Oh, it will be rejected alright. And then you're forced to override the parser, or to manipulate the XML before parsing it because it makes business sense to not have the source fix their XML for some reason.

People and machines are just utterly incapable of outputting valid XML.

JSON parsers have a much smaller 'feature surface' meaning that there are fewer nooks and crannies for bugs to live in.

One example of a bug that often festered in XML parsers: https://en.wikipedia.org/wiki/Billion_laughs (there is no JSON equivalent of this)

The generalized theory, for those interested : https://en.wikipedia.org/wiki/Rule_of_least_power

While I'd agree that parsing JSON is much easier than XML, it is still not completely trivial as demonstrated by this article: http://seriot.ch/parsing_json.php
What from I grok the guy requesting JSON-LD wants this functionality
Probably this part:

> simpler to read and write

If you're writing these things by hand, you're probably doing something wrong...
Deserializing somebody else's XML to some usable internal data structures generally requires writing serialization/deserialization by hand and it is always a pain in the ass. On the other hand, JSON basic structures map to reasonable internal representations, so I often can simply iterate through the structures coming as-is from the parser library.

I mean, if the same webservice is offering the same data in both XML and JSON format, chances are I'd have to write less code for handling the JSON endpoint. For a client written in e.g. Java both cases may be pretty much equal, but for dynamic languages like Javascript or Python, the difference is significant.

This is a straw man, IMO. Obviously, in production, the actual JSONs will interact very little with humans. But there's still development, debugging, etc.

So you will need to write small cases during development, tweak existing cases, etc.

Also, many tools accept configuration in JSON, which is somewhat convenient to write by hand, and is easily machine readable. Sublime Text comes to mind, for example.

JSON is also easier for computers to read and write...
XML generators and parsers have been in use for a decade+. Pretty sure most of the bugs have been found and fixed by now.

It's just reinventing the wheel because the new generation don't want to use the same tools the previous generation did. The time and effort spent doing this is quite ridiculous.

(FWIW, I hate XML, JSON is far better. But there's more important things to work on).

> Pretty sure most of the bugs have been found and fixed by now.

Given the complexity and what I've seen from some other long established codebases, I don't share your confidence.

> It's just reinventing the wheel because the new generation don't want to use the same tools the previous generation did.

You can disagree with the decisions involved (as you did with the XML vulnerability argument), but the fact that those arguments exist means they AREN'T doing it just because they don't want to use the same tools the previous generation did - they have different reasons that you think aren't good reasons.

Saying it as you did comes across as smug and dismissive, which is not an effective way of convincing your audience that you've taken arguments into account when making your decision.

RSS is sometimes ambiguous and there's a lot of variation. It can be hard to parse correctly. Not sure about Atom, though.
> RSS is sometimes ambiguous and there's a lot of variation.

I've written a reasonably-popular podcast feed validator, and I don't understand either of these criticisms. Mind elaborating?

Not the parent but my company consumed a bit of RSS starting in 2005 (and with the amounts declining to 0 through the years).

Over time we've been fed feeds with character encodings not matching what the web server nor the XML declared. Use of undeclared XML namespaces, or quite popular: using elements from other namespaces, without namespaces or declarations -- just shove some nice iTunes things or Atom things into the RSS. Also invalid XML -- just skipping the closing tags was popular.

These feeds were from paying customers, and we were not the primary consumers - so when we complained they would generally point to someone else who was consuming it without problem. Sometimes we'd point them at a validator, if they were a small enough customer -- but mostly we just kept working on our in house RSS feed reader that could read tag soup.

Things did massively improve over time, and that by the end we were getting _mainly_ reasonably valid RSS.

Not been writing XML parsers, but I remember Nick Bradbury the creator of the FeedDemon fame wrote about it a lot 'back in the days',

* https://nickbradbury.com/2006/09/21/fixing_funky_fe_1/

* http://nick.typepad.com/blog/2004/01/feeddemon_and_w.html

* https://en.wikipedia.org/wiki/FeedDemon

Since you've done it recently, I'm sure you know more than I do; I suspect my knowledge of it is obsolete.
> I've written a reasonably-popular podcast feed validator

Mind sharing?

Nice, very cool! Definitely an improvement on the trash legacy validators out there.
I couldn't help but take a dismissive stance toward the rest of the page after reading the first paragraph.