Hacker News new | ask | show | jobs
by hcoyote 2997 days ago
I used to work at a vertically-focused web search engine and ran the operational side of the crawler.

Also missing from this discussion would be a mechanism to rate limit (and determine adequate rate limits, based on your error rates) the crawl.

Also, detecting that you've been blocked and backing off so as not to further hammer the site you're crawling with requests. Related:

IP management is an issue here as well: lots of places just carte blanche block whole ranges from crawling activity. And will you be honoring robots.txt or not?

Be prepared for people to block you in new and stupid ways: once got blocked from hitting the site's name servers to even do lookups against them. They blackholed our packets. So what should have been a ~500ms DNS query at each http request turned into a 15s pause while the DNS request timed out ... eventually this stacked up across all threads, backing the overall crawling infrastructure to deadlock.

The Wayback Machine architecture is probably a good, public implementation of a large scale crawling mechanism. This post[1] about it may be a bit dated, but it's probably still accurate.

[1] http://highscalability.com/blog/2014/5/19/a-short-on-how-the...