Hacker News new | ask | show | jobs
by joubert 6059 days ago
What do you mean by the third point?

3. Fiction: XML adds values. Fact: aascii tab delimited in consistent formats add value, while XML SUBTRACTS value.

2 comments

I mean that for all practicing data analysts that I know, XML is a pain in the ass (parsers, xpaths, etc, all to get it into a csv that you can import), while nice ascii text is easy to work with.

If you want metadata, a well written narrative paragraph along with a code book is INFINITELY better than embedding the metadata in the data.

Furthermore, a lot of supposed metadata in XML is just dross like "<column>blah</column>".

Finally, all the crap in XML way ups the signal/ noise ratio; if you do need something that maps to a complex data structure use JSON or something rational. Such needs are not very common in data analysis, in contrast to web applications; data analysts use multiple tables and are usually pretty close to relational databases and SQL (even if they don't call it that).

Also, its incredibly difficult to deal with large (>100mb) datasets in XML format. Loading that thing into RAM for an XML parser is ridiculous. Tab delimited data is really the best format possible as you can easily build MapReduce scripts if needed to manipulate it.
I almost always write my own stream parser with regular expressions to deal with large XML files (especially very regular ones), though it should be noted that there are stream XML parsers.
Knowing regular expressions is an all around good idea when doing data processing. Steep learning curve but pays itself off in increased productivity.

What stream XML parsers do you use? I just get my data ready for Hadoop and let it go.

To be honest, I just kind of think I know that there are stream XML parsers? I've used cElementTree when I have small XML documents and written my own regex for larger ones. (cElementTree is definitely not a stream parser)
I can imagine some circumstances where the hierarchical structure of XML would be useful, but in just about every data processing job I've undertaken that involved XML my first step was to get rid of XML and convert it to something like .csv or ascii tab delimited.
If you need hierarchies, use JSON, IMHO, or keys that reference between tables (the census PUMS data does this with persons nested within households, using two tables).