Hacker News new | ask | show | jobs
by outpan 3566 days ago
That post is what triggered my Ask post.

The problem is the huge contrast with https://www.quora.com/How-much-would-it-cost-to-crawl-1-bill...

Even taking into account the drop in prices on AWS. Also, if you take a quick look at companies that provide such services the prices are orders of magnitude higher than deusu's costs.

1 comments

Deusu's crawl servers are located at https://www.hosteurope.de/en/Server/Root-Server/ while the website points to his home broadband ISP. Two servers at his specs would be 200 Euro/month total, with 5x more bandwidth than he currently uses. I'd say that's much cheaper that AWS. Of course crawl companies charge more: they run a business, pay system administrators, have more backup and redundancy.
I'm not sure how he manages to crawl with this speed using such low amount of resources.

We did a benchmark on Nutch and couldn't really pass the 10-14 M(B)ps on a $1200/month machine. Even though we hired a professional to optimize the setup. The same is roughly true about Heritrix.

Just wondering if there is something missing in his setup, such as domain/ip rate limiting.

You can check his source if you are curious how it works ;)