Hacker News new | ask | show | jobs
by elorant 3587 days ago
In my experience the best way to crawl in a polite way is to never use an asynchronous crawler. The vast majority of small to medium sites out there have absolutely no kind of protection from an aggressive crawler. You make 50 to 100 requests per second chances are you’re DDoS-ing the shit out of most sites.

As for robots.txt problem is most sites don’t even have one. Especially e-commerce sites. They also don’t have a sitemap.xml in case you don’t want to hit every url just to find the structure of the site. Being polite in many cases takes a considerable effort.

4 comments

Scrapy is asynchronous, but it provides many settings that you can use to avoid DDoS a website, such as limiting the amount of simultaneous requests for each domain or IP address.

And yes, crawling politely requires a bit of effort from both ends: the crawler and the website.

I agree, at the end of the day being polite or not is on the developer and not the tool itself...
Search engine crawlers use adaptive politeness: start being very polite, and ramp up parallel fetches if the site responds quickly and has a lot of pages.
That's kind of what Scrapy's AUTO_THROTTLE middleware does.
You can rate-limit asynchronous crawlers too.