I have done some scraping of the amazon.com previously, but they are pretty good at detecting bot's and shutting them down, how did you get around this problem when scraping millions of pages?
Basically it boils down to three things:
1. If the site is slow,crawl slooowly.
2. If you see non-200 http error codes, stop!
3. Obey robots.txt and speed restrictions.
Basically it boils down to three things: 1. If the site is slow,crawl slooowly. 2. If you see non-200 http error codes, stop! 3. Obey robots.txt and speed restrictions.