| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by joubert 6105 days ago
	What do you mean by the third point? 3. Fiction: XML adds values. Fact: aascii tab delimited in consistent formats add value, while XML SUBTRACTS value.

2 comments

wsprague 6105 days ago

I mean that for all practicing data analysts that I know, XML is a pain in the ass (parsers, xpaths, etc, all to get it into a csv that you can import), while nice ascii text is easy to work with.

If you want metadata, a well written narrative paragraph along with a code book is INFINITELY better than embedding the metadata in the data.

Furthermore, a lot of supposed metadata in XML is just dross like "<column>blah</column>".

Finally, all the crap in XML way ups the signal/ noise ratio; if you do need something that maps to a complex data structure use JSON or something rational. Such needs are not very common in data analysis, in contrast to web applications; data analysts use multiple tables and are usually pretty close to relational databases and SQL (even if they don't call it that).

link

physcab 6104 days ago

Also, its incredibly difficult to deal with large (>100mb) datasets in XML format. Loading that thing into RAM for an XML parser is ridiculous. Tab delimited data is really the best format possible as you can easily build MapReduce scripts if needed to manipulate it.

link

llimllib 6104 days ago

I almost always write my own stream parser with regular expressions to deal with large XML files (especially very regular ones), though it should be noted that there are stream XML parsers.

link

physcab 6104 days ago

Knowing regular expressions is an all around good idea when doing data processing. Steep learning curve but pays itself off in increased productivity.

What stream XML parsers do you use? I just get my data ready for Hadoop and let it go.

link

llimllib 6104 days ago

To be honest, I just kind of think I know that there are stream XML parsers? I've used cElementTree when I have small XML documents and written my own regex for larger ones. (cElementTree is definitely not a stream parser)

link

kurtosis 6105 days ago

I can imagine some circumstances where the hierarchical structure of XML would be useful, but in just about every data processing job I've undertaken that involved XML my first step was to get rid of XML and convert it to something like .csv or ascii tab delimited.

link

wsprague 6105 days ago

If you need hierarchies, use JSON, IMHO, or keys that reference between tables (the census PUMS data does this with persons nested within households, using two tables).

link