Hacker News new | ask | show | jobs
by Sharlin 4481 days ago
The issue is probably that 99.999% of all XML use cases don't use (or need) the verification aspect. For all of those, XML is overkill. Besides, surely it would be possible to design a verification layer on top of JSON, for instance - the fact that one does not currently exist does not mean that XML (and abuse of XML!) should not be criticized.
3 comments

One of the core aspects of XML that is really important is that no typing is inferred by the structure of the file unlike JSON. JSON is by nature tied to the JavaScript type system which is sparse and inaccurate. For example, if you look at the following:

   { "name": "bob", "salary": 1e999 }
Ah crap! Deserializer blew (in most cases silently converting the number to null)

   <person>
      <name>bob</name>
      <salary>1e999</salary>
   </person>

No problem. The consumer can throw that at their big decimal deserialiser.

And the following is not acceptable as it breaks the semantics of JSON and requires a secondary deserialisation step as strings ain't numbers...

   { "name": "bob", "salary": "1e999" }
JSON is a popular format but it's awful.
I think it's refreshing to hear someone advocate XML instead of JSON, specifically because you bring up a good point.

The problem I think is that just because XML is human-readable, it's less sufficient as a format that is human-writable (I'm looking at you, Maven!). I believe this is the root cause that many people hate XML, even though it has a very sweet spot in application-to-application communication.

I would even argue that XML is not even that human-readable. Take a look at this pom: https://maven.apache.org/pom.html#The_Super_POM . Even with syntax highlighting it is extremely difficult to parse visually. Compare that to nginx's custom config file format: http://wiki.nginx.org/FullExample .
If you take the brackets and the closing tags out (use meaningful space) it's a hell of an improvement[1], . A format I really like (ok it's aimed at html not xml) is the slim templating language[2]. It manages to pack the same information in but is massively more readable.

[1] https://gist.github.com/opsb/9424457

[2] http://slim-lang.com/

Yeah this is exactly where my hate towards Maven configuration comes from, but it's more a testimonial of a bad fit for configuration files than critique towards XML. Java enterprise application configuration has the tendency to be very "expert-friendly", and this is where XML got its bad name from.
> Ah crap! Deserializer blew (in most cases silently converting the number to null)

Right -- the parser blew it. That many implementations do this is frustrating (and caused me so many problems that I ended up building my own validator for problems like this: http://mattfenwick.github.io/Miscue-js/).

JSON doesn't set limits on number size. From RFC 4627:

An implementation may set limits on the range of numbers.

It's the implementation's fault if the number is silently converted to null.

I guess we need better implementations!

> JSON is a popular format but it's awful.

If you're willing to take the time to share, I'd love to hear more examples of JSON's problems. I'm collecting examples of problems, which I will then check for in my validator!

If you're looking for examples of problems, RFC7159 (http://rfc7159.net/rfc7159) is a good place to start - just search for 'interop', as suggested by [1]. A quick look at Miscue-js suggests you already check for most of them, but you might still find something new.

[1] http://www.tbray.org/ongoing/When/201x/2014/03/05/RFC7159-JS...

Your example doesn't do anything but make XML look as bad as your saying JSON is. Think about it again - do you think your first XML example doesn't ALSO have to be deserialized twice (once into an XML in memory tree, once into a number)? It does. Also, both examples will fail if you try to deserialize either of them into numbers...

Regardless, JSON is so much more readable that I'm very glad it's pushed XML out of the picture for the most part.

Actually no you couldn't be more wrong.

XML can be read as a stream and at certain points like after reading an element or attribute, an object can be created on the fly or a property on an object set and the type deserialised at the same time. The types don't have to be native types either; they can be complex types or aggregate types such as any numeric abstraction or date type you desire.

See java.xml.stream (Java) and System.Xml (CLR) for example.

As for readability, some XML is bad which is probably what you've seen but there's plenty that's well designed.

XML is afflicted with piles of criticism which usually comes from poor understanding or looking at machine targeted schemas that humans don't care about.

You'd complain the same if you looked at protobufs over the wire with a hex editor.

And the following is not acceptable as it breaks the semantics of JSON and requires a secondary deserialisation step as strings ain't numbers...

XML strings ain't numbers neither. You can throw a big decimal deserialiser (e.g. as a custom deserialization adapter) at a JSON document as well.

Let's break this down into two statements:

XML doesn't have strings (or types at all really)

JSON strings are strings.

There is a massive semantic difference here when it comes to parsing.

What is that massive semantic difference? If you want the number represented by 1e999 as the value for salary, at some point, something has to take "1e999", whether you call it a string or a something-with-no-type, and turn it into a number. Your deserializer has to know to do that in either case.
As follows. It's more how the abstraction works.

XML:

  ->[byte stream]->[deserializer]->[bignum]
JSON:

  ->[byte stream]->[json reader]->[string]->[deserializer]->[bignum]
The latter is, well, wrong.
Multiple JSON deserializers have that mapping integrating, eliminating those steps. See, for example, the ContextualDeserializer in Jackson.
How does the [deserializer] step in the XML example know to call into [bignum], and why can't the [json reader] in the JSON example have that knowledge in the same fashion?
The equivalent of your XML would be:

    {"name": "bob", "salary": "1e999"}
I believe that creates a string with the characters "1e999", not the number 1e999.
Same as the XML
I don't think XML does either by itself. The schema will determine which fields are parsed as strings and which are parsed as numbers.
iff you have a schema, and a parser that actually uses it. I've seen a few DTDs but the vast majority of XML documents don't have a schema or even a DTD to follow.

And the vast majority of parsers will not parse anything for you, regardless of schema definitions.

Which effectively puts you in the same place as the JSON string.

Exactly.
Either to author of the serialized data realized that the numbers could overflow a float or didn't. This is independent of serialization format.

In your contrived example, somehow, the user of JSON didn't realize the salary could overflow a float. (OTOH, he succeeded in serializing it, mysteriously.) All the while, the XML user was magically forward thinking and deserialized the value into a big decimal. Your argument simply hinges on making one programmer smarter than the other. If one knows that a value will not fit a float, the memory representation won't be a float and the serialization format won't use a float representation. It has nothing to do with JSON vs XML.

This. Types are a huge pain in JSON, particularly the lack of a good date time type. BSON fixes tips, but only of you're using MongoDB and are willing to give up the "human readable" requirement outside of mongo.
JSON's semantics is that you represent numbers by their decimal representation.

In this particular case, you're giving a different representation, so of course you an pass it as a string.

His point was that this number is too large to store it in a Javascript Number variable (which is a IEEE 754 double).
OK, so the provided number format is not sufficient for the kind of numbers he is trying to deal with. So instead you would represent it as a string and handle the encoding/decoding of that number yourself. How is that different from the XML way where there is no provided number format to begin with, and everything is a string?
That's completely irrelevant. Grok the JSON specs and reconsider what the javascript number format has to do with it.
1e99 is valid JSON, that isn't what he is complaining about. See: http://json.org/number.gif
People seem to prefer JSON, but I don't find it any better to hand-write/hand-edit than XML. If anything it's slightly worse, because it has more syntax edge cases.
And it doesn't support the multitude of accurate numeric types that XML does implicitly. XML data is not just "strings", it's a sequence of characters. The deserializer determines what sort of type it is based on either the structure or the language's capabilities. With XML, you can define these policies. With JSON you're stuck with JavaScript being the semantic standard and type definitions which ties you to floats or numbers inside strings. The latter is criminal.

Edit: clarification as HN won't let me reply any more.

How so? XML by itself only supports strings; any other data types have to be derived from a schema. But you can do the same with any other format that supports strings, including JSON.
But in the design of XML this was already acknowledged.

That's why there is the distinction between well-formed and valid XML documents. Only with valid XML documents there is a schema attached that will describe these nodes with the type attribute. And because it is extendable, these types can be anything but they will be automatically validated by the parser.

JSON OTOH doesn't have this extensibility. There are a couple of predefined types but if you need to go beyond them (and this happens all the time because JSON doesn't even define a date type!) any interpretation is up to the parsing program and this can vary tremendously (again, look at the handling of dates and for example the questions on stackoverflow about them).

Only with valid XML documents there is a schema attached that will describe these nodes with the type attribute.

http://json-schema.org/

What's the issue?

It's still a draft (and if I may nit-pick, an expired draft).

It only has "complete structural validation". Which means it doesn't feature custom types.

Although it adds a workaround for the date issue by adding a handful of supported sub-types (http://json-schema.org/latest/json-schema-validation.html#an...)

It is far from what validation XML Schemas offer.

JSON is explicitly not designed to be hand-editable. Hence, for example, no comments.

It's just meant to be human readable.

If you want human editable "json", use Yaml: http://www.yaml.org/ (it's a superset of Json that adds comments, linking etc.)

How is YAML a superset of JSON? Do you mean 'conceptually'?
To be specific, JSON syntax is a subset of YAML version 1.2.

However, I hate YAML with a passion. It is worse than XML in my books. I can usually read JSON fine. I can also read XML in many cases. For the life of me, I just can't read YAML. It has something to do with "-", line indentation and different ways of writing lists.

Of course, someone will say YAML is technically better ...

Same here, it is very difficult for me to tell levels of nested structures in yaml. Though I'm sure if I sat down and read up on it I could force it into my brain. But shouldn't it be intuitive to read without that?
Precisely.

Python has exactly the same problem -- control-structure nesting quickly gets confusing and hard to read beyond a certain (fairly small) size -- but at least with python, you have the option of splitting off stuff into separate functions to limit the amount of nesting and size of blocks.

Different for me, i would prefer YAML over XML or JSON
Do you use your naked eyes or do you have any tool recommendations? I don't see YAML going away so I'd better deal :)
It's technically true because YAML includes an alternate "inline style" that lets you write objects in JSON syntax. Therefore any JSON object is a valid YAML object as well. But, not an idiomatically written YAML object, since writing YAML using only inline style is unusual.
No, it is a superset. Every JSON document is a valid YAML document.
> because it has more syntax edge cases

Could you provide examples? I'm trying to collect more examples for a JSON validator -- http://mattfenwick.github.io/Miscue-js/ (built during a big project using JSON, after I started running into some issues that I couldn't check using other validators)

I'd love to hear more examples if you're willing to share.