Hacker News new | ask | show | jobs
by Jdam 2929 days ago
The issue with those tools is that Amazon changes the product layout very often and heavily conducts A/B tests. I’ve once even heard that computer vision is the most stable way to scrape Amazon. I guess this library will stop working rather soon.
3 comments

>I’ve once even heard that computer vision is the most stable way to scrape Amazon

At a former employer we scraped Amazon many millions of times per day with very simple old tools that rarely needed updating.

Are you able to share some details? How often did you have to get new IP addresses? What about user agent? Were the scapers "straight to the point" like amazon2csv (ie: make a request directly to the search page) or did they have randomized behavior (eg: re-use sessions from time to time; click a random link on the page; start from the homepage...)? Did the scrapers ever went against amz's robots.txt directives (eg: interacting with the cart page)? Ever heard from amz itself about your employer's activities on their site?
There are services dedicated to scrapping which can take care of proxy-ing your requests so you don't have to worry about IP bans.

For example, Scrapinghub's Crawlera (the guys behind the Scrapy python lib)

Same here. Scraping their search results page is easy if you have a bunch of IPs. No manipulation or workarounds needed(ie headless browsers, ensuring your http headers look like a real user).

I have not scraped a ton of actual individual product pages though so cant testify about scraping that. I do remember it might have been harder.

> I guess this library will stop working rather soon.

Don’t really see that as a dealbreaker. So the library will need maintenance. Normal for libraries to need updates in order to keep up with changes. It works today, and it will work whenever it’s updated. Better than nothing and for many use cases that’s good enough.

Search results scraping on Amazon is fairly stable.

What's more difficult is product page scraping, because there you have hundreds of different variations. Some from A/B testing and a lot just being specific things that show up for certain product categories (e.g. video games).