Hacker News new | ask | show | jobs
by LisaG 4686 days ago
Limited resources are the only reason. We are working on a subset crawl of ~3 million pages that will be published weekly starting two weeks from now. But doing the full crawl takes a lot of time, effort and money.
2 comments

Is that really worth it though? I can crawl 3 million pages in less than 24 hours without any real effort on my part. Or are you going to provide 3 million of the most useful pages? Depth or breadth first crawl?
We do think it is worth it to avoid duplicative efforts.

Suppose you crawl 3 million pages and you pay for the compute and storage costs. Then the next person who wants crawl data goes through the same effort and pays the same costs. Doesn't it make much more sense to have a common pool of open data that everyone can use? Even if the effort and costs are low, they are not zero.

For the smaller frequent crawl, we are working with Mozilla and we are will do the top pages (top according to Alexa).

Fair point and makes sense. If you publish the rank along with the data itself that would be very useful. Perhaps having a few sets of data? 3 million top pages, 3 million deep pages etc...

Personally I would like to see around 20-100 million pages or whatever is about 500-1000GB. That's enough data to work with on a local machine and serve up some meaningful results assuming you want to build a search engine or just do some deep analysis of the web.

Isn't there also the additional factor that webservers sometimes allow only the major search engines to crawl? If so, with something like this, should it gain popularity, and as more apps start using it, you'd hope more webservers allow the common crawler to crawl their websites which they might not if everyone were doing it individually...thinking aloud...
Just because you can do it without much effort doesn't mean less experienced people can. Crawling can be a barrier to some people.
To be honest a simple crawler is a very simple thing to write. If someone had issues getting that going I think they are going to have issues with the data volume anyway. LisaG answered why the 3 million data set though and I agree with the reasoning.
Could you partner with other orgs that have the same needs? Like the Internet Archive?
Internet Archive (currently) doesn't want to put their data on any cloud service. We believe it is crucial that people can easily access and analyze the data so we put it on various cloud platforms. We are talking with a few organizations about getting data donations that we could put in our corpus and make available to everyone, but nothing is settled enough that I can publicly comment on those potential partnerships yet.