| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by RhodesianHunter 2935 days ago
	>I’ve once even heard that computer vision is the most stable way to scrape Amazon At a former employer we scraped Amazon many millions of times per day with very simple old tools that rarely needed updating.

2 comments

mxvzr 2935 days ago

Are you able to share some details? How often did you have to get new IP addresses? What about user agent? Were the scapers "straight to the point" like amazon2csv (ie: make a request directly to the search page) or did they have randomized behavior (eg: re-use sessions from time to time; click a random link on the page; start from the homepage...)? Did the scrapers ever went against amz's robots.txt directives (eg: interacting with the cart page)? Ever heard from amz itself about your employer's activities on their site?

link

lapnitnelav 2935 days ago

There are services dedicated to scrapping which can take care of proxy-ing your requests so you don't have to worry about IP bans.

For example, Scrapinghub's Crawlera (the guys behind the Scrapy python lib)

link

AznHisoka 2935 days ago

Same here. Scraping their search results page is easy if you have a bunch of IPs. No manipulation or workarounds needed(ie headless browsers, ensuring your http headers look like a real user).

I have not scraped a ton of actual individual product pages though so cant testify about scraping that. I do remember it might have been harder.

link