Hacker News new | ask | show | jobs
by scarygliders 1691 days ago
Right with you there.

I had a particularly bad time not so long ago, when a customer's site - a shop - was brought to its knees because someone, probably a competitor, hired some scraper-company of some sort to scrape every product and price.

The scraper would systematically go through every single product page.

And by scraper, I mean - 100's of them. All at the same time, using the old trick of 1 scraper requesting 3 or 4 product pages at a time then pausing for a while.

They used umpteen different IP address blocks from all over the globe - but mainly using OVH vps IP address blocks from France.

Now, maybe if they'd just thrown, say, 5 or 10 of the scraper "units" at the site, no one would have noticed in amongst Googlebot (which they wanted to use anyway because they are using Google Shopping to try to bring in more sales).

But no. This shower of arseholes threw 100's of scraper "tasks" at the site. They got greedy.

Now, the site was robust enough to handle this load - barely - which was massive, however, having to do that /and/ also handle normal day-to-day traffic? Nah. The bastards got greedy and like you I spent a few days unfucking the damage they were causing.

Seriously, I hate scrapers. I hate the people who make scrapers. I hate their lack of ethics. Fuck those guys.

7 comments

> Seriously, I hate scrapers. I hate the people who make scrapers. I hate their lack of ethics. Fuck those guys.

Not everybody in this space is out to destroy your site. Some of us actively try to put as little load on your site as possible. My scraper puts less load on sites than I do when I browse them normally, I've measured it. Really sucks when we get lumped together with the other abusers and blocked.

Exactly, some of us use scrapers because while we can't go full Richard Stallman, we also don't want to visually sift through ridiculous UI just to look at some basic data/text.
> we also don't want to visually sift through ridiculous UI just to look at some basic data/text

Yeah.

First scraper I ever built was for my school portal. Absolutely atrocious user interface. It got to the point that I seriously hated that site so I built a script to log into it and download my information. I just wanted to see my grades without suffering.

In a past life, we were consulting with a startup that offered a subscription data service. They were very sensitive about scrapers, especially on the time limited try-before-you-buy accounts, which competitors were abusing.

At their request, we built a method to flag accounts for data poisoning. Once flagged, those accounts would start getting plausible-ish looking garbage data.

It was pretty effective. One competitor went offline for a few days about a week after that started, and had a more limited offering when they came back up.

That's a good way of going about dealing with this kind of abuse indeed. Wish I'd thought of doing that at the time, but due to the nature of this shop you didn't need a user account to browse the products/prices.

I'm now making an entirely new shop for them - I shall bear this in mind. Thanks for that!

Yea. Detect them and mess with them is the only approach that seems to work for a lot of abusive activity. Banning doesn’t work because they will just start over from scratch. The only thing you can really do is make them think you haven’t “caught” them yet and during that stretch make sure their time is wasted.
It sucks when this happens, but it's easily avoidable by using a caching frontend of some sort.

My favorite is Varnish,[0] which I have used with great success for _many_ web sites throughout the years. Even a web site that 10+ millions of requests per day ran from a single web server for a long time a decade-ish ago.

[0] https://varnish-cache.org/

> Seriously, I hate scrapers. I hate the people who make scrapers. I hate their lack of ethics. Fuck those guys.

Wait till you find out what half of Google's business is based on (spoiler - scraping).

I really don't think scraping itself is an issue 90% of the time. It's the behavior of the out of control scrapers that are the problem. A well behaved scraper should barely be noticeable, if at all.

At least google's scraping does result in your website being discoverable by users. So you get something out of it. That's not to say that sometimes Google is missing or stealing data they scrape. But at least there is some benefit. Many other scrapers are merely taking the data to compete.
I strongly feel that if a human can get to it manually, we have to accept that either it will be botted or humans will be paid to do it by hand (They call these people "analysts" or "market researchers").

I might argue that what google actually uses their scraped data for is their search engine - which is private. They simply allow us access to specially crafted queries, which they can and do manipulate (for many reasons, some good some bad).

The only thing I'd say meets that definition would be like Common Crawl.

Exactly. I am surprised that the 'devs' can't figure out a way to block only annoying/excessive scrapers. Most likely they are just lazy and then just put 3rd party 'solution' and job done. Pay me.
>Seriously, I hate scrapers. I hate the people who make scrapers. I hate their lack of ethics. Fuck those guys.

if a scraper is effectively DDoSing you, call it what it is -- a denial of service attack.

i've found from experience that most scraping attempts originate against host-sites that are generally user-hostile; no APIs to use, JS tricks to bother user browsing, or groups that profit from first-mover advantage and thus try to obscure data.

So, if your sites are commonly the victim of scrapers that are harvesting publicly available data i've found that it's more useful to ask myself what alternatives I could provide those that feel the need to scrape.

As for a 'lack of ethics' on how publicly available data is wrangled -- well, i'll just say that I feel that it remains the responsibility of the administrator rather than being something to push the blame onto clients for. There are plenty of technical avenues to pursue before appealing to morals and ethics for help.

This and the post you are replying to both sound like sabotage by a competitor rather than legit data collecting.
If your site is so poorly written it can't handle a few hundred computers trying to do something as simple as loading your product pages then sorry, but that's on you. The information is on the public web and scrapers are as entitled to access it as any web browser.