| HN Mirror

I agree that initially you should dump it to a local filesystem. Since this is an experiment you don't want to get bogged down in DB performance details.

Also, if HD space is a concern, occasionally tar/zip up a bunch of the data. HTML is very redundant and I'd bet you could squeeze 500GB of HTML down to < 50GB, even more if you have a lot of pages from the same site.

Really, a lot of this depends on what resources you have available and how you want to process the data later on. If you are classifying pages independently of one another then why bother pooling them to a centralized DB? Just run your classifier on each node and pool those results instead.

An alternative solution is S3, which I've used for crawling storage before. Its not ideal for data processing since you have to constantly pull data over the network, but its an easy way to get centralized storage.