Hacker News new | ask | show | jobs
by ebbp 1690 days ago
Having spent a week battling a particularly inconsiderate scraping attempt, I’m quite unsurprised by the juvenile tone and fairly glib approach to the ethics of bots/scraping presented by the piece.

For the site I work for, about 20-30% of our monthly hosting costs go towards servicing bot/scraping traffic. We’ve generally priced this into the cost of doing business, as we’ve prioritised making our site as freely accessible as possible.

But after this week, where some amateur did real damage to us with a ham-fisted attempt to scrape too much too quickly, we’re forced to degrade the experience for ALL users by introducing captchas and other techniques we’d really rather not.

8 comments

Right with you there.

I had a particularly bad time not so long ago, when a customer's site - a shop - was brought to its knees because someone, probably a competitor, hired some scraper-company of some sort to scrape every product and price.

The scraper would systematically go through every single product page.

And by scraper, I mean - 100's of them. All at the same time, using the old trick of 1 scraper requesting 3 or 4 product pages at a time then pausing for a while.

They used umpteen different IP address blocks from all over the globe - but mainly using OVH vps IP address blocks from France.

Now, maybe if they'd just thrown, say, 5 or 10 of the scraper "units" at the site, no one would have noticed in amongst Googlebot (which they wanted to use anyway because they are using Google Shopping to try to bring in more sales).

But no. This shower of arseholes threw 100's of scraper "tasks" at the site. They got greedy.

Now, the site was robust enough to handle this load - barely - which was massive, however, having to do that /and/ also handle normal day-to-day traffic? Nah. The bastards got greedy and like you I spent a few days unfucking the damage they were causing.

Seriously, I hate scrapers. I hate the people who make scrapers. I hate their lack of ethics. Fuck those guys.

> Seriously, I hate scrapers. I hate the people who make scrapers. I hate their lack of ethics. Fuck those guys.

Not everybody in this space is out to destroy your site. Some of us actively try to put as little load on your site as possible. My scraper puts less load on sites than I do when I browse them normally, I've measured it. Really sucks when we get lumped together with the other abusers and blocked.

Exactly, some of us use scrapers because while we can't go full Richard Stallman, we also don't want to visually sift through ridiculous UI just to look at some basic data/text.
> we also don't want to visually sift through ridiculous UI just to look at some basic data/text

Yeah.

First scraper I ever built was for my school portal. Absolutely atrocious user interface. It got to the point that I seriously hated that site so I built a script to log into it and download my information. I just wanted to see my grades without suffering.

In a past life, we were consulting with a startup that offered a subscription data service. They were very sensitive about scrapers, especially on the time limited try-before-you-buy accounts, which competitors were abusing.

At their request, we built a method to flag accounts for data poisoning. Once flagged, those accounts would start getting plausible-ish looking garbage data.

It was pretty effective. One competitor went offline for a few days about a week after that started, and had a more limited offering when they came back up.

That's a good way of going about dealing with this kind of abuse indeed. Wish I'd thought of doing that at the time, but due to the nature of this shop you didn't need a user account to browse the products/prices.

I'm now making an entirely new shop for them - I shall bear this in mind. Thanks for that!

Yea. Detect them and mess with them is the only approach that seems to work for a lot of abusive activity. Banning doesn’t work because they will just start over from scratch. The only thing you can really do is make them think you haven’t “caught” them yet and during that stretch make sure their time is wasted.
It sucks when this happens, but it's easily avoidable by using a caching frontend of some sort.

My favorite is Varnish,[0] which I have used with great success for _many_ web sites throughout the years. Even a web site that 10+ millions of requests per day ran from a single web server for a long time a decade-ish ago.

[0] https://varnish-cache.org/

> Seriously, I hate scrapers. I hate the people who make scrapers. I hate their lack of ethics. Fuck those guys.

Wait till you find out what half of Google's business is based on (spoiler - scraping).

I really don't think scraping itself is an issue 90% of the time. It's the behavior of the out of control scrapers that are the problem. A well behaved scraper should barely be noticeable, if at all.

At least google's scraping does result in your website being discoverable by users. So you get something out of it. That's not to say that sometimes Google is missing or stealing data they scrape. But at least there is some benefit. Many other scrapers are merely taking the data to compete.
I strongly feel that if a human can get to it manually, we have to accept that either it will be botted or humans will be paid to do it by hand (They call these people "analysts" or "market researchers").

I might argue that what google actually uses their scraped data for is their search engine - which is private. They simply allow us access to specially crafted queries, which they can and do manipulate (for many reasons, some good some bad).

The only thing I'd say meets that definition would be like Common Crawl.

Exactly. I am surprised that the 'devs' can't figure out a way to block only annoying/excessive scrapers. Most likely they are just lazy and then just put 3rd party 'solution' and job done. Pay me.
>Seriously, I hate scrapers. I hate the people who make scrapers. I hate their lack of ethics. Fuck those guys.

if a scraper is effectively DDoSing you, call it what it is -- a denial of service attack.

i've found from experience that most scraping attempts originate against host-sites that are generally user-hostile; no APIs to use, JS tricks to bother user browsing, or groups that profit from first-mover advantage and thus try to obscure data.

So, if your sites are commonly the victim of scrapers that are harvesting publicly available data i've found that it's more useful to ask myself what alternatives I could provide those that feel the need to scrape.

As for a 'lack of ethics' on how publicly available data is wrangled -- well, i'll just say that I feel that it remains the responsibility of the administrator rather than being something to push the blame onto clients for. There are plenty of technical avenues to pursue before appealing to morals and ethics for help.

This and the post you are replying to both sound like sabotage by a competitor rather than legit data collecting.
If your site is so poorly written it can't handle a few hundred computers trying to do something as simple as loading your product pages then sorry, but that's on you. The information is on the public web and scrapers are as entitled to access it as any web browser.
Bots are one of those things that are easy to build and hard to get right, and there's really no way of preparing for the chaotic reality of real web pages other than fixing the problems as they show up. Weird and unexpected interactions are going to happen. Crawling the real web involves navigating a fractal of unexpected, undocumented and non-standard corner cases. Nobody gets that right on the first try. Because of that I do think we need to be a bit patient with bots.

At the same time, even as someone who runs a web crawler, I have zero qualms about blocking misbehaving bots.

I kinda feel like rate limiting your request to individual domains and IP addresses is an easy thing that goes a long way towards getting it right.
There are still snags with that.

Stuff like redirect resolution is very easy to overlook. You may think you're fetching 1 URL per second, but if you are using the wrong tool and you're on a server that has you bouncing around like in a pinball machine and takes you through a dozen redirects for every request, the reality may be closer to 10 requests per second.

On top of that, sometimes the same server has multiple domains. Sometimes the same IP-address serves a large number of servers (maybe it's a CDN).

If you build your site in a way that multiplies each request 10x, well then that's what you get. Don't do that and you won't have issue with requests. Or handle those requests properly. There are solutions to that. You know how many requests your local google CDN gets? They know how to manage load.
Most pages have at least a http->https redirect, many contain a lot of old links to http content.

Usually it's error pages that really drive the large redirect chains. They often have a vibe of like some forgotten stopgap put in place to help with some migration to a version of the site that is no longer in existence.

Of course you don't know it's an error page until you reach the end of the redirect chain.

If an amateur can do that to your service by scraping, imagine what someone can do if they actually intend to do you harm. With cloud pricing models someone could find a little misconfiguration or oversight and put you in the hole in operating costs. Anti-abuse is a necessary design when your service is exposed to the internet.

Not saying that doesn't suck - it does, it's why many ideas don't work in practice as an online service.

I'm right there with you. I'm the lead engineer for an automotive SaaS provider (with ~6000 customers and ~4 billion requests per month) and we recently started moving all our services to Cloudflare's WAF to take advantage of their bot protection. We were getting scrapes from botnets in the 100000+ per minute range that was affecting performance.

We chose to switch to the JS challenge screen as it requires no human interaction. We now block 75% (estimated to the best of our knowledge) of bot traffic but some customers are livid over the challenge screen.

I'm really surprised that the JS challenges helped so much, given that there are open source libraries for bypassing them (e.g. cloudscraper[0]).

[0]: https://github.com/venomous/cloudscraper

If someone wanted to get past it they probably could. We've had a few sources of traffic that we've had to straight up block (as opposed to challenge) because of this exact issue. So far it's been a "good enough" solution that blocks enough of the bot traffic to be effective.
What were they scraping, if I can ask? Was it targeted or just wget -r style?
It was a hybrid of low-effort vulnerability scanning and targeted inventory scraping. Many dealerships in the automotive space will pay gray-hat third parties to scrape and compile data on their competitors.

The irony for us as a provider is that it's one of our customers (party A) paying a third party to scrape data from another one of our customers (party B) which in turn affects the performance of party A's site. We've started blocking these third parties and directing them to paid APIs that we offer.

And how do you get your 'inventory data'? Aren't you scraping (or using scraped data) yourself? Oh the irony :)
No, we're a contracted provider for these customers. They ingest their data into our network through APIs or CSVs.
Makes little sense - customers upload data to you and they don't want any data back? Really?
Why do you think those bots were scraping your data in the first place?
Why not create api endpoint and charge mild cost for that data? You’ll make money instead of spending it.
Do you honestly believe all site scraper people/companies are ethical enough to go to whoever pays /them/ to scrape data from a competitor's site and say "oh they offer an API to access this data let's pay for that", instead of "why pay for that data when we can scrape it right off their site"?

Also, not all types of company will provide API endpoints. It all depends on the type of site - for example, an online shop might not wish to provide easily accessible data on offered products and prices, to their competitors who may wish to undercut them. Why would an online shop do that?

I run a large scraper farm against several large sites. They're not online shops, and we don't compete with them. But they do have hundreds of thousands of data points that we use to provide reports and analytics for our clients, who also do not compete with the sites.

I absolutely would pay for an API that provides that data. I'd be willing to pay 10x more than the cost of maintaining and running the scrapers.

But the sites being scraped have no interest in that.

Have you tried approaching those sites and asking them to provide an API, pointing out that it would be easier for both of you in the long run? Or are you just assuming they wouldn't do it.

Because right now, I sure wish that the bots - which comprise probably 2/3 of my traffic - are causing me huge headaches and I wish that the people doing it would tell me what the heck they want.

Yes, we have. And no, they are not interested.
Building and maintaining the scraper is the not cost they would use to measure it internally. It’s the cost to build the API, and support it and perhaps any perverse incentive it creates where even more data flows out to competitors.
For all intents and purposes, this isn't competitive data for them. There aren't really competitors in the space anyway, the barrier to entry is ridiculous. In fact, by law, operators in the industry are required to share this particular data with each other and industry regulators. But they don't share it with outside parties in the aggregate form we need it in. Hence, the scraping.
Building API is 5 times easier than building routes for your public webpages, which is basically an 'API' as well.
And the cost of being scraped.
Well, you don't need an api, just a CSV file with a catalog.

The scraping company WILL use the API/CSV file... they will probably also still charge their customer for scraping, so it's a win-win :D

You can think of it this way, the prices and product data are publicly visible already on the website, there are no real secrets, none of it is password protected.

You can be principled and insist on blocking bots and spend a lot of time and money on tools, people, and ultimately hosting because the bots will always win; or you can offer the data for free/minimal fee and serve it with almost zero cost and cache it so you can do that with a micro sized server.

You can always lie about some of the prices if you want, but you will just encourage bots again.

Ethics are nice, but let's be honest, very lacking. Sometimes it's better to be pragmatic.

> You can think of it this way, the prices and product data are publicly visible already on the website, there are no real secrets, none of it is password protected.

There's the problem right there. The prices and product data are publicy visible - because there is a target audience of /humans/ for whom the site is designed and intended to be used by. The site is not there to cater for a competitor's scrapers.

I don't care how much people couch their unethical behaviour in "the data is publically available", the basic fact is most if not all websites exist for human eyeballs to look at them. They do not exist for arseholes to DOS them by inundating them with scrapers.

From my perspective, the problem is that the data that is offered isn't really "for humans". The data is for convincing the humans to buy/pay or worse, browse and watch ads as a result.

But overall, information is one of those goods that has intrinsic properties like no other. It can be copied, infinitely. And we haven't yet figured out the dynamics of how to reason about it, so it feels like we're pretending they're physical goods.

Edit. Side note. I'd go further and say that some of the data is even worse, it's "offered" with the real intention being to confuse the users into performing non-optimally in the market. Look at Amazon/Ebay/AliExpress/Google listings for evidence of that. Just Google - Google is a ML and scraping power house, and the best they can muster is to be spammed with fake websites and duplicate/confusing listings.

You hit the nail on the head. It's hard to have sympathy for site operators complaining about scraping, where almost every site does its best[0] to make using it a time consuming, potentially risky and overall annoying ordeal. Not to mention, information asymmetry is anathema to a well-functioning market, and yet no. 1 reason for fighting bots given in the whole thread here is a desire to maintain that information asymmetry.

And that's also the dirty secret behind the "attention economy": it's whole point is to make things as inefficient as possible, because if you're making money on people's attention, you need to first steal it (by distracting them from what they're trying to achieve), and then either direct towards your goals (vs. those of the users), or stretch it out to maximize their exposure to advertising.

--

[0] - Sometimes unintentionally. Unfortunately, the overall zeitgeist of UX design is heavily influenced by bad players, so default advice in the industry is often already intrinsically user-hostile.

> the basic fact is most if not all websites exist for human eyeballs to look at them.

There's a whole ethical subthread here of websites trying to making the experience for those humans miserable, and taking away the agency necessary to protect oneself from that. A browser is a user agent. So is a screen reader. So is a script one writes to not deal with bullshit fluff, when all one wants is a simple table of products, features and prices.

I agree 100%, but it is a fact of life, and sometimes it's better to just minimize the fuzz and focus on the things that matter.

Your argument is perfectly valid and applies to offline activities as well (what stops a competitor from walking through the aisles of a Walmart or Costco?), but this is a battle that can't be won, there are too many parasitic actors. It is human nature.

Understanding your competitor's pricing is not "parasitic", it's research. Every company I've ever worked for that sells something online scrapes their competitors in some way (whether with bots or with interns).
> (what stops a competitor from walking through the aisles of a Walmart or Costco?)

That's a significant portion of Nielsen's business model.

Let's not encourage these unethical people to even think of using human eyeballs and manual data entry for their scraping instead of bots. That sounds pretty darn unethical.
> Why would an online shop do that?

Because otherwise the HTML will become the API.

Ethical - of course not. Practical.

Valuable public data is going to be scraped - this is inevitable. Even paywalled or signup protected valuable data is going to be scraped.

Why not sell valuable data for reasonable price then.

My point was more that we can accept with, and live with, scrapers but expect some minimal level of consideration if you’re going to abuse our very expensively gathered dataset. Sending us 10x daily traffic so you can scrape quicker than the fair usage policy of our API allows is just… poor etiquette? Unkind? Not really sure how to phrase it. I’m exhausted after multiple 18 hours days trying to keep our website online for the public.
As a programmer that just sometimes wants to check if given item is available in store I would like to be able to use API for that. But if it is not available one has to scrape.
>where some amateur did real damage to us

If an amateur can do damage to you, then I have some bad news for you...

This is nonsense. It's always easier to destroy than to build/mantain. If you got any real advice, by all means...
If an amateur can do damage to you, then I have some bad news for you...

I believe the point wasn't surprise that damage occurred at all, but frustration that damage can occur just out laziness/ignorance rather than malice.

Indeed, that was precisely their point, and "bad news for you" is disingenuous as there are many techniques used by incompetent, or just downright unethical and greedy scraper companies which, no matter how robust the target is, can still give it a major headache.

I've witnessed a site being basically DOS'ed due to particularly greedy and aggressive mass scraping attempts.

Precisely this, thank you.
To be clear, they did “damage” was to our bottom line. Most sites don’t capacity plan for random cliff walls of 2-10x traffic (clearly we should!). We’re scalable enough to handle that traffic after a period, but a) it caused intermittent periods of low availability (costing us money because we didn’t generate income the way we normally do) and b) cost us money from scaling all our services up.

It’s just selfish. If you’re going to take the product of other people’s work in a manner they don’t consent to, at least do it in a way that doesn’t cost them twice over.

Considering the demand for your content, why haven’t you created and provided an API? Maybe you could monetize?
I wrote a scraper a couple of years ago to get a single data point from a website where my client was already a paying customer. This website had an API, which they were also paying for, but the API didn't cover that data point, so at the time they had one of their admin people populating that missing piece of data manually, which was taking them around ten minutes a day.

I asked them if my customer could pay to access this data point via their API and they quoted 3600 EUR/month! Enter the scraper...

We do offer an API - the scrapers are trying to circumvent using that, presumably.
Why do you think are they trying to circumvent it?

Does your API provide all the information that can be found on the site, or are they scraping because the API is incomplete?

We've once had to scrape Amazon product pages because they have a lot of API endpoints, but those didn't contain the data we needed.

This is the number one reason to scrape websites. It's always nice when there's an API with documentation and rate limiting rules you can follow. Sometimes the data I need just isn't there, though. Then I open up their site and find a huge amount of private API endpoints that do exactly what I want. Then I open up a ticket about it and it gets 200 replies but they ignore it for years. It's fucking stupid and it's really no wonder people scrape their site.
Why would Amazon wish to provide you with easy to access data on their products and prices when you could either be a competitor wishing to undercut those prices, or be a scraper company hired by such a competitor?

In what universe is providing such a straightforward way of helping a competitor considered sane business practice?

Most sellers who are on Amazon platform give Amazon that information and a lot more, knowing full well Amazon will use their sales data to launch an Amazon Basics competitior.

It is a sane business approach when you are a pragmatic business who knows the limits that constrain your business.

Either the content company is going to build a simple API (could be just a static CSV file hosted on S3 or whatever) with useful information or try to monetize/hide this information and force scapers to use the website .

A bot is always going to win unless you want to make users also a lot of friction. In the era of deepfakes and fairly robust AI tooling the difference between bot action and humann action is not all that much.

If you are going to be agressive with captcha , IP blocks and other fingerprinting, users who get identified false positive.or annpyed would leave.

When the cost of losing those users is more than allowing access to scrapers,you would absolutely setup the API.

Man your comment is hilarious because in fact Amazon DOES provide an API for exactly that
And yet...

> We've once had to scrape Amazon product pages because they have a lot of API endpoints, but those didn't contain the data we needed.

...only a couple of comments up.

Because they will get the data regardless of what you do and if you don't make an API it will cost you more due to overhead.
Markets are competitive and efficient when all parties have full information. If Amazon doesn't want its prices to be known amd finds ways to successfully prevent them from being scrapes, in some sense the state should force it to disclose them via API (or something equivalent)
In the end, they still get the data, just in a much less desirable way for both you and the customer.
Is it not viable to put majority of your data behind a login and so the bots only get a very limited snapshot while legitimate users get it through a free login?

I’m asking this because I’m going through very similar situation and would love to see other opinions around this.

You are defining legitimate users as those that have a valid session cookie? Good luck
Maybe the API terms/cost are prohibitive? I'm sure there's some equilibrium where they would rather pay you than go through the trouble of scraping.
Maybe docs or infra are unbearable
What is your site may I ask?

Just curious about the difference in value from using your API and web scraping as there is a cost to web scraping as well.

If you make your scraper well, and it counterfeits being a real user believably, you end up with a solution that can be tweaked as needed to handle whatever traps people put in to try to defeat your scrapers.

If you make your api client well, you don't have the problems of a scraper - but if the api owner decides to change rules for api and you can't do what your business is based on being able to do (think of api owner as Twitter) then you need to make a scraper.

Wait, why wouldn't you have rate limiting on your API? Providers like Cloudflare offer this although I guess you could roll your own too since our industry loves to reinvent the wheel.
Like everyone and their brother has a web spider. And some of them are VERY badly designed. We block them when they use too many resources, although we'd rather just let them be.

Can't speak for the op but we have APIs and move the ones scraping and reselling our content to APIs. The majority are just a worthless suck on resources though.