Hacker News new | ask | show | jobs
by mtmail 3566 days ago
There's a discussion about a 2 billion page crawl on the frontpage right now. https://news.ycombinator.com/item?id=12486631

Here's the author's comment on hardware https://news.ycombinator.com/item?id=12487003 and later he says it costs 300 Euro/month to run the service.

1 comments

That post is what triggered my Ask post.

The problem is the huge contrast with https://www.quora.com/How-much-would-it-cost-to-crawl-1-bill...

Even taking into account the drop in prices on AWS. Also, if you take a quick look at companies that provide such services the prices are orders of magnitude higher than deusu's costs.

Deusu's crawl servers are located at https://www.hosteurope.de/en/Server/Root-Server/ while the website points to his home broadband ISP. Two servers at his specs would be 200 Euro/month total, with 5x more bandwidth than he currently uses. I'd say that's much cheaper that AWS. Of course crawl companies charge more: they run a business, pay system administrators, have more backup and redundancy.
I'm not sure how he manages to crawl with this speed using such low amount of resources.

We did a benchmark on Nutch and couldn't really pass the 10-14 M(B)ps on a $1200/month machine. Even though we hired a professional to optimize the setup. The same is roughly true about Heritrix.

Just wondering if there is something missing in his setup, such as domain/ip rate limiting.

You can check his source if you are curious how it works ;)