Hacker News new | ask | show | jobs
by outpan 3566 days ago
I'm not sure how he manages to crawl with this speed using such low amount of resources.

We did a benchmark on Nutch and couldn't really pass the 10-14 M(B)ps on a $1200/month machine. Even though we hired a professional to optimize the setup. The same is roughly true about Heritrix.

Just wondering if there is something missing in his setup, such as domain/ip rate limiting.

1 comments

You can check his source if you are curious how it works ;)