| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by yourabi 6473 days ago

Take a look at what is out there.

If you run a simple crawl with Heritrix (as example) you'll notice it stores everything in 'ARC' files which are basically compressed (zip) archives with an index to access individual records (via offsets).

I would avoid sticking everything in a database, although you could probably get away with it -- but I agree with aristus that it probably doesn't matter at this point.

Another idea is that you could look at a static html dump of Wikipedia and see how their structure their tree (three letter prefixes)

On the flip side, having it in the DB will probably be easier in terms of managing it (one place to backup) and possibly as an easier way of splitting up workload across multiple boxes -- ex: three boxes could query db, suck down all pages for a couple hundred domains, do processing, insert when done