| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Aloisius 531 days ago

I think a default max of 1 request every 5 seconds is unnecessarily meek, especially for larger sites. I'd also argue that requests that browsers don't slow down for, like following redirects to the same domain or links with the prefetch attribute, don't really necessitate a delay at all.

If you can detect a site has a CDN, metrics like time-to-first-byte are low and stable and/or you're getting cache control headers indicating you're mostly getting cached pages, I see no reason why one shouldn't speed up - at least for domains with millions of URLs.

I disagree with using HEAD requests for refreshing. A HEAD request is rarely cheaper and sometimes more expensive for some websites than a GET If-Modified-Since/If-None-Match. Besides, you're going to fetch the page anyway if it changed, so why issue two requests when you could do one?

Having a single crawler per process/thread makes rate limiting easier, but it can lead to some balancing and under-utilization issues with distributed crawling due to the massive variation in URLs per domain and site speeds, especially if you use something like a hash to distribute them. For Commoncrawl, I had something that monitored utilization and shut down crawler instances which would redistribute URLs pending from the machines shutting down to the machines left (we were doing it on a shoestring budget using AWS spot instances, so it had to survive instances going down randomly anyway).

I'd say one of the best polite things to do when crawling is to add a URL to the crawler user agent pointing to a page explaining what it is and maybe letting people opt-out or explain how to update their robots.txt to let them out-out.