|
|
|
|
|
by jws
6426 days ago
|
|
Smells like 500GB of data. I'd save keep the crawled data in filesystems on the crawling boxes. Then you can load your mysql database and when it fails because <<insert-unforeseeable-circumstance>> you can take another shot at loading it from your data. After you resign yourself to working with a subset of the data in mysql you will learn how to compute what you really want to know and write a fast processor to just scan the spooled data you have on your search machines and put that into the database instead of the raw data. [[edit: maybe 500GB instead of 5TB, got a little crazy on my zero key in bc]] |
|
Also, if HD space is a concern, occasionally tar/zip up a bunch of the data. HTML is very redundant and I'd bet you could squeeze 500GB of HTML down to < 50GB, even more if you have a lot of pages from the same site.
Really, a lot of this depends on what resources you have available and how you want to process the data later on. If you are classifying pages independently of one another then why bother pooling them to a centralized DB? Just run your classifier on each node and pool those results instead.
An alternative solution is S3, which I've used for crawling storage before. Its not ideal for data processing since you have to constantly pull data over the network, but its an easy way to get centralized storage.