|
|
|
|
|
by LisaG
4691 days ago
|
|
We do think it is worth it to avoid duplicative efforts. Suppose you crawl 3 million pages and you pay for the compute and storage costs. Then the next person who wants crawl data goes through the same effort and pays the same costs. Doesn't it make much more sense to have a common pool of open data that everyone can use? Even if the effort and costs are low, they are not zero. For the smaller frequent crawl, we are working with Mozilla and we are will do the top pages (top according to Alexa). |
|
Personally I would like to see around 20-100 million pages or whatever is about 500-1000GB. That's enough data to work with on a local machine and serve up some meaningful results assuming you want to build a search engine or just do some deep analysis of the web.