| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by LisaG 4691 days ago

We do think it is worth it to avoid duplicative efforts.

Suppose you crawl 3 million pages and you pay for the compute and storage costs. Then the next person who wants crawl data goes through the same effort and pays the same costs. Doesn't it make much more sense to have a common pool of open data that everyone can use? Even if the effort and costs are low, they are not zero.

For the smaller frequent crawl, we are working with Mozilla and we are will do the top pages (top according to Alexa).

2 comments

boyter 4691 days ago

Fair point and makes sense. If you publish the rank along with the data itself that would be very useful. Perhaps having a few sets of data? 3 million top pages, 3 million deep pages etc...

Personally I would like to see around 20-100 million pages or whatever is about 500-1000GB. That's enough data to work with on a local machine and serve up some meaningful results assuming you want to build a search engine or just do some deep analysis of the web.

link

dsinha 4689 days ago

Isn't there also the additional factor that webservers sometimes allow only the major search engines to crawl? If so, with something like this, should it gain popularity, and as more apps start using it, you'd hope more webservers allow the common crawler to crawl their websites which they might not if everyone were doing it individually...thinking aloud...

link