Hacker News new | ask | show | jobs
You probably misunderstand XML (lemire.me)
71 points by msbmsb 5692 days ago
7 comments

Thumbs down.

He taught an entire course on XML, which he calls a "great meta-example on how to deal with semi-structured data"? And his only defense of XML over JSON is... it's worked ok for some file formats?

The only point in this whole article is that XML is not well-suited for RPCs, though he fails to argue that it's well-suited for anything else.

One argument is that XML is better than JSON for use cases like XHTML, where you heavily mix tags and content. I get the feeling XML wasn't really made for this case, though, it was made for the JSON-like case. Processing XHTML with E4X (the "XML for JavaScript" standard) is painful, and XML libraries in general assume your document basically consists of a tree of tags, maybe with text nodes at the leaves.

I was expecting some argument invoking the power of DTDs and XSLT or whatever else, or the original point of XML that people overlook, and all I got was an extremely weak defense of XML from someone who taught a whole course on it.

"I get the feeling XML wasn't really made for this case, though, it was made for the JSON-like case."

Back in 1997, XML was "SGML for the Web." It was a way to pass around structured, plain-text, human-readable documents that did not require expensive, buggy, incomplete parsers.

It then got misapplied as an RPC transport encoding, and tools vendors were more than happy to start pushing specs, such as W3C Schemas, that demanded the use of tools.

It started out to be simple, but, as things happen, got hijacked. But the fault is with the misapplication, not XML itself.

If you read the annotated xml spec, it really is quality work. I don't necessarily agree with every design decision, but I think a lot of people look at the complexity of xml applications and falsely blame xml itself.

Sadly there were some sensible early formats that were left behind. XML-RPC's serialization is a bit verbose but otherwise is quite similar to JSON. Somehow that got turned into SOAP and then eventually the WS- tar pit of complexity.

Likewise XML as a configuration file language can be quite elegant, almost like a literate coding version of common .ini or .conf files. But instead of this simple flat document littered with variables, xml config files in the wild end up with deeply nested structure that contributes dubious value and makes the files far less human friendly.

XML itself, with the possible exception of namespaces and a few other features, is quite simple. I totally agree it's the applications that have gotten out of hand, particularly in areas where XML is used as structured data exchange rather than document markup.

I guess the reason why this happened was because we didn't have JSON then, so XML looked like the best available option to many people at the time (except for those few who knew about s-expressions).

Now that we have JSON, there is no longer any excuse.

If you believe his best point is that XML is not well-suited for RPCs then you've missed the point.

XML is good for exactly what it stands for: an extensible markup language. It's good for dealing with semi-structured data, especially when you have to deal with data from multiple domains.

Have you ever used SGML (other than HTML)? If so, then you'd likely agree that XML is a superior standard. But I'm guessing that you have not, because for some reason you believe that XML was created for data serialization.

DTDs and XSLT _are_ useful aspects of XML and I doubt the author in unaware of them. Rather the author assumed too much of the readers in understanding the history of XML and the nature of semi-structured data.

I don't agree at all. Mixed content was once the primary use case for XML. XML and SGML before it was made for marking up documents and documents contain mixed content. They're not databases and they're not message formats containing structured data.
The only good point for XML is there are existing tools that do things via XML. There are tools that generate ATOM and RSS for you. And there are tools that consume ATOM and RSS. So if XML is already a well defined and followed standard for what you want do, use XML. In all other situations use something else.
<p class="content">Then help me translate this into <span class="highlight">JSON</span> please</p>

How about that way:

{tag: "p", class: "content", text: ["Then help me translate this into ", {tag: "span", class: "highlight", text: "JSON"}, " please"]}

And that's just a small example where you can see all the start and end tags on one screen. Now change the example to insert a hyperlink, say, around the word "help". How easy is it to change?

Who said anything about JSON? XML is just shit. It requires documents be well formed which sucks when you want a secretary to deal with the documents. It's just a poor combination strict and loose.
The difference between a format that adheres to some deterministically parsable syntax and one that doesn't is not something that I would characterise as "shit" or "not shit".
"One argument is that XML is better than JSON for use cases like XHTML, where you heavily mix tags and content."

Yes, this is true. The point of using XML is when you have data where you know the structure of some parts, but not others. This is true of most things that begin life as prose, and then have some structure added to them later. It is a point between "bag of words" information retrieval, and SQL queries, that requires a different approach.

"I get the feeling XML wasn't really made for this case, though, it was made for the JSON-like case."

No, this is false. XML is awful for the JSON like case. What would make you think that XML was created for it?

My biggest issue with XML is "XML misunderstands Unix philosophy". You can't easily use cut, awk and grep with XML without having a million of edge case to handle. There are some tools like XMLStarlet, xsltproc or xalan. But you can't safely extract content from XML files with standards tools even if you use the XML extension for gawk.

You could argue that XML documents are complex and cannot be described using simple comma separated. Maybe but some many XML documents are just there to store simple key,value data.

And now, we have "jsawk" (https://github.com/micha/jsawk) for parsing JSON under your terminal...

Actually, the problem is XML demonstrates the limits of the "UNIX philosophy". Plain text simply isn't the be-all, end-all of formats. You can't easily use cut, awk, or grep on JSON, either. The "UNIX philosophy" does poorly with trees and graphs (in the computer science sense).

That's not a bad thing. The UNIX philosophy encourages you to avoid those things if you don't need them. It's very powerful. But when you actually, factually need them, you're not going to get very far with UNIX tools. That's OK; it is neither an indictment of UNIX nor of the data. Different tools are called for.

xsl and transforms are what you use to extract data from xml. xslt is the coolest thing about xml IMHO.

the only real complaint I have is that xsl, being itself xml, is pretty verbose and can be tedious to write.

xslt is beyond tedious, it is infuriating. There are few use cases for xslt that would not be better served with a procedural technique, eg. python and a parser.

also the whole "using xml to define a transformation on some other xml" thing is so overly meta as to induce a massive brain hemorrhage out of my nose and all over my desk.

Fortunately you can now use XPath for simple queries and XQuery beyond that.
Forgot about them. Both XPath and XQuery are excellent technologies and a probably the best thing about XML in my experience with it. I highly recommend everyone concerned with XML check them out if they havent already. I never seem to see much mention of them around XML discussions.

http://www.w3schools.com/xquery/xquery_intro.asp

> I never seem to see much mention of them around XML discussions.

Which is a shame since I've been banging my head against a particular set of problems for a while with XML, and XQuery nicely resolves that, and I think it's easier to use than SQL for the most part.

Ironically probably the most popular application of XPath in the real world is jQuery selectors! Although I don't know whether that weakens or strengthens the case for JSON...
Glad to hear SOAP is basically done: http://blogs.msdn.com/b/interoperability/archive/2010/11/10/...

That's laws of natural selection at work.

As Tim Lister once said, if everybody's getting it wrong, there's something wrong with it.
That's a cop-out. You really have to define "everybody."

I know a guy who deployed a Java application on servers with 64MB of memory, and he did it back before the JIT compiler was any good. It was performant and got the job done. He's not unique: lots of performant Java applications were built on hardware that was tiny compared to today's hardware. But for some reasonable meaning of "everybody," everybody writes horrible bloated Java code that requires costly hardware to run.

I've used simple, practical XML web services -- in fact, we have several running at work, and when adding or changing functionality, dealing with the XML aspect is a rounding error compared to implementing the application logic. But for some reasonable meaning of "every," everybody writing enterprise XML web services creates overengineered, overcomplex, finicky interfaces that require ongoing error-prone tweaking of DOM or SAX code.

Sometimes when everybody's getting it wrong, that just means "it" has proved irresistible to stupid people and PHBs. It doesn't mean a sensible, tasteful engineer won't be able to use it correctly. Ditching a technology because stupid people love to misuse it may be a good fashion choice, and it may have a good way to influence hiring if you don't have more direct influence, but there's no engineering justification for it.

And don't forget that for some reasonable meaning of "everybody," everybody who has tried Lisp programming has become horribly lost and failed to accomplish anything with it. (This may be less true since Lisp is rarely taught in colleges nowadays, but it was true at some point in time.)

Dupe, and not very old at that: http://news.ycombinator.com/item?id=1916489.
I looked for a prior submission before posting - search and browse. Somehow I missed it.
<fx: thoughtful frown>

http://searchyc.com/submissions/xml?sort=by_date

First result.

Got it.
It is the author of this article who, despite claiming to have taught a course on XML, seems to misunderstand XML.

I think one of the persons who best understood XML was Erik Naggum, or at least few have explained it so eloquently:

http://harmful.cat-v.org/software/xml/s-exp_vs_XML

That Naggum email was beautiful, informative, funny and wildly digressive. Mind blown.
Which is the part that I was supposed to have misunderstood exactly?