Hacker News new | ask | show | jobs
by isoos 1111 days ago
Proposal for web-scrapers to self-coordinate: monitor the response latencies and rate limit yourself based on that.

(1) Limit your load / parallelism: do not use more than 4 threads, and if the site takes more than a few seconds to respond, use fewer.

(2) Limit aggregate load: after each request, sleep/wait at least the amount that the previous request took to get served.

(3) If you need more than that, ask the site owners for direct channel.

This way if multiple crawler happen to crawl the same resource-limited site, the site may have a fighting chance.