| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by blart 4915 days ago
	I have done some scraping of the amazon.com previously, but they are pretty good at detecting bot's and shutting them down, how did you get around this problem when scraping millions of pages?

1 comments

netvarun 4915 days ago

Some great advice here on crawling at scale, which has inspired our crawlers a lot : http://news.ycombinator.com/item?id=4367933

Basically it boils down to three things: 1. If the site is slow,crawl slooowly. 2. If you see non-200 http error codes, stop! 3. Obey robots.txt and speed restrictions.

link