Hacker News new | ask | show | jobs
by purerandomness 1694 days ago
Why do you think are they trying to circumvent it?

Does your API provide all the information that can be found on the site, or are they scraping because the API is incomplete?

We've once had to scrape Amazon product pages because they have a lot of API endpoints, but those didn't contain the data we needed.

2 comments

This is the number one reason to scrape websites. It's always nice when there's an API with documentation and rate limiting rules you can follow. Sometimes the data I need just isn't there, though. Then I open up their site and find a huge amount of private API endpoints that do exactly what I want. Then I open up a ticket about it and it gets 200 replies but they ignore it for years. It's fucking stupid and it's really no wonder people scrape their site.
Why would Amazon wish to provide you with easy to access data on their products and prices when you could either be a competitor wishing to undercut those prices, or be a scraper company hired by such a competitor?

In what universe is providing such a straightforward way of helping a competitor considered sane business practice?

Most sellers who are on Amazon platform give Amazon that information and a lot more, knowing full well Amazon will use their sales data to launch an Amazon Basics competitior.

It is a sane business approach when you are a pragmatic business who knows the limits that constrain your business.

Either the content company is going to build a simple API (could be just a static CSV file hosted on S3 or whatever) with useful information or try to monetize/hide this information and force scapers to use the website .

A bot is always going to win unless you want to make users also a lot of friction. In the era of deepfakes and fairly robust AI tooling the difference between bot action and humann action is not all that much.

If you are going to be agressive with captcha , IP blocks and other fingerprinting, users who get identified false positive.or annpyed would leave.

When the cost of losing those users is more than allowing access to scrapers,you would absolutely setup the API.

Man your comment is hilarious because in fact Amazon DOES provide an API for exactly that
And yet...

> We've once had to scrape Amazon product pages because they have a lot of API endpoints, but those didn't contain the data we needed.

...only a couple of comments up.

You don't know what data they needed. Maybe they needed reviews or product descriptions. The API doesn't cover everything but it does cover the exact use case I was replying to.
Because they will get the data regardless of what you do and if you don't make an API it will cost you more due to overhead.
Markets are competitive and efficient when all parties have full information. If Amazon doesn't want its prices to be known amd finds ways to successfully prevent them from being scrapes, in some sense the state should force it to disclose them via API (or something equivalent)
In the end, they still get the data, just in a much less desirable way for both you and the customer.