Hacker News new | ask | show | jobs
by physcab 6059 days ago
Also, its incredibly difficult to deal with large (>100mb) datasets in XML format. Loading that thing into RAM for an XML parser is ridiculous. Tab delimited data is really the best format possible as you can easily build MapReduce scripts if needed to manipulate it.
1 comments

I almost always write my own stream parser with regular expressions to deal with large XML files (especially very regular ones), though it should be noted that there are stream XML parsers.
Knowing regular expressions is an all around good idea when doing data processing. Steep learning curve but pays itself off in increased productivity.

What stream XML parsers do you use? I just get my data ready for Hadoop and let it go.

To be honest, I just kind of think I know that there are stream XML parsers? I've used cElementTree when I have small XML documents and written my own regex for larger ones. (cElementTree is definitely not a stream parser)