Hacker News new | ask | show | jobs
by blart 4868 days ago
I have done some scraping of the amazon.com previously, but they are pretty good at detecting bot's and shutting them down, how did you get around this problem when scraping millions of pages?
1 comments

Some great advice here on crawling at scale, which has inspired our crawlers a lot : http://news.ycombinator.com/item?id=4367933

Basically it boils down to three things: 1. If the site is slow,crawl slooowly. 2. If you see non-200 http error codes, stop! 3. Obey robots.txt and speed restrictions.