This is the number one reason to scrape websites. It's always nice when there's an API with documentation and rate limiting rules you can follow. Sometimes the data I need just isn't there, though. Then I open up their site and find a huge amount of private API endpoints that do exactly what I want. Then I open up a ticket about it and it gets 200 replies but they ignore it for years. It's fucking stupid and it's really no wonder people scrape their site.
Why would Amazon wish to provide you with easy to access data on their products and prices when you could either be a competitor wishing to undercut those prices, or be a scraper company hired by such a competitor?
In what universe is providing such a straightforward way of helping a competitor considered sane business practice?
Most sellers who are on Amazon platform give Amazon that information and a lot more, knowing full well Amazon will use their sales data to launch an Amazon Basics competitior.
It is a sane business approach when you are a pragmatic business who knows the limits that constrain your business.
Either the content company is going to build a simple API (could be just a static CSV file hosted on S3 or whatever) with useful information or try to monetize/hide this information and force scapers to use the website .
A bot is always going to win unless you want to make users also a lot of friction. In the era of deepfakes and fairly robust AI tooling the difference between bot action and humann action is not all that much.
If you are going to be agressive with captcha , IP blocks and other fingerprinting, users who get identified false positive.or annpyed would leave.
When the cost of losing those users is more than allowing access to scrapers,you would absolutely setup the API.
You don't know what data they needed. Maybe they needed reviews or product descriptions. The API doesn't cover everything but it does cover the exact use case I was replying to.
Markets are competitive and efficient when all parties have full information. If Amazon doesn't want its prices to be known amd finds ways to successfully prevent them from being scrapes, in some sense the state should force it to disclose them via API (or something equivalent)
Is it not viable to put majority of your data behind a login and so the bots only get a very limited snapshot while legitimate users get it through a free login?
I’m asking this because I’m going through very similar situation and would love to see other opinions around this.
If you make your scraper well, and it counterfeits being a real user believably, you end up with a solution that can be tweaked as needed to handle whatever traps people put in to try to defeat your scrapers.
If you make your api client well, you don't have the problems of a scraper - but if the api owner decides to change rules for api and you can't do what your business is based on being able to do (think of api owner as Twitter) then you need to make a scraper.
Wait, why wouldn't you have rate limiting on your API? Providers like Cloudflare offer this although I guess you could roll your own too since our industry loves to reinvent the wheel.
Does your API provide all the information that can be found on the site, or are they scraping because the API is incomplete?
We've once had to scrape Amazon product pages because they have a lot of API endpoints, but those didn't contain the data we needed.