Hacker News new | ask | show | jobs
by ebbp 1687 days ago
We do offer an API - the scrapers are trying to circumvent using that, presumably.
5 comments

Why do you think are they trying to circumvent it?

Does your API provide all the information that can be found on the site, or are they scraping because the API is incomplete?

We've once had to scrape Amazon product pages because they have a lot of API endpoints, but those didn't contain the data we needed.

This is the number one reason to scrape websites. It's always nice when there's an API with documentation and rate limiting rules you can follow. Sometimes the data I need just isn't there, though. Then I open up their site and find a huge amount of private API endpoints that do exactly what I want. Then I open up a ticket about it and it gets 200 replies but they ignore it for years. It's fucking stupid and it's really no wonder people scrape their site.
Why would Amazon wish to provide you with easy to access data on their products and prices when you could either be a competitor wishing to undercut those prices, or be a scraper company hired by such a competitor?

In what universe is providing such a straightforward way of helping a competitor considered sane business practice?

Most sellers who are on Amazon platform give Amazon that information and a lot more, knowing full well Amazon will use their sales data to launch an Amazon Basics competitior.

It is a sane business approach when you are a pragmatic business who knows the limits that constrain your business.

Either the content company is going to build a simple API (could be just a static CSV file hosted on S3 or whatever) with useful information or try to monetize/hide this information and force scapers to use the website .

A bot is always going to win unless you want to make users also a lot of friction. In the era of deepfakes and fairly robust AI tooling the difference between bot action and humann action is not all that much.

If you are going to be agressive with captcha , IP blocks and other fingerprinting, users who get identified false positive.or annpyed would leave.

When the cost of losing those users is more than allowing access to scrapers,you would absolutely setup the API.

Man your comment is hilarious because in fact Amazon DOES provide an API for exactly that
And yet...

> We've once had to scrape Amazon product pages because they have a lot of API endpoints, but those didn't contain the data we needed.

...only a couple of comments up.

You don't know what data they needed. Maybe they needed reviews or product descriptions. The API doesn't cover everything but it does cover the exact use case I was replying to.
Because they will get the data regardless of what you do and if you don't make an API it will cost you more due to overhead.
Markets are competitive and efficient when all parties have full information. If Amazon doesn't want its prices to be known amd finds ways to successfully prevent them from being scrapes, in some sense the state should force it to disclose them via API (or something equivalent)
In the end, they still get the data, just in a much less desirable way for both you and the customer.
Is it not viable to put majority of your data behind a login and so the bots only get a very limited snapshot while legitimate users get it through a free login?

I’m asking this because I’m going through very similar situation and would love to see other opinions around this.

You are defining legitimate users as those that have a valid session cookie? Good luck
Maybe the API terms/cost are prohibitive? I'm sure there's some equilibrium where they would rather pay you than go through the trouble of scraping.
Maybe docs or infra are unbearable
What is your site may I ask?

Just curious about the difference in value from using your API and web scraping as there is a cost to web scraping as well.

If you make your scraper well, and it counterfeits being a real user believably, you end up with a solution that can be tweaked as needed to handle whatever traps people put in to try to defeat your scrapers.

If you make your api client well, you don't have the problems of a scraper - but if the api owner decides to change rules for api and you can't do what your business is based on being able to do (think of api owner as Twitter) then you need to make a scraper.

Wait, why wouldn't you have rate limiting on your API? Providers like Cloudflare offer this although I guess you could roll your own too since our industry loves to reinvent the wheel.