Hacker News new | ask | show | jobs
by flaburgan 640 days ago
I was recently speaking with people from OpenFoodFacts and OpenStreetMap, and I guess Wikipedia as the same issue. They are under constantly DDoS by bots which are scraping everything, even if the full dataset can be downloaded for free with a single HTTP request. They said this useless traffic was a huge cost for them. This is not about copyright, just about bots being stupid and people behind them not caring at all. We for sure need a solution to this. To maintain a system online nowadays means not only they get your data but you pay for that!
2 comments

To be fair, some 20 years ago when I wanted to do something with Wikipedia data, I scraped them too, after having tried quite a bit to use the dumps.

- dump availability was shaky at best back then (could see months go by without successful dumps)

- you had to fiddle with it to actually process the dumps

- you'd get the full wikipedia content, but you didn't have the exact wikipedia mediawiki setup, so a bunch of things were not rendered

- you couldn't get their exact version of mediawiki, because they added more than what was released openly

Now, I'm not saying that they were wrong to do that back then, and I assume things have improved. Their mission wasn't to provide an easy way to download & import the data so it wasn't a focus topic, and they probably ran more bleeding edge versions of mediawiki and plugins that they didn't deem stable enough for general public consumption. But it made it very hard to do "the right thing", and just whipping up a script to fetch the URLs I cared about (it was in Perl back then!) was orders of magnitude faster.

At least for me, had they offered an easy way to set up a local mirror, I would've done that. I assume this is similar for many scrapers: they're extremely experienced at building scrapers, but they have no idea how to set up some software and how to import dumps that may or may not be easy to manage, so to them the cost of writing a scraper is much smaller. If you shift that imbalance, you probably won't stop everyone from hitting your live servers, but you'll stop some because it's easier for them not to and instead get the same data from a way that you provided them.

Can relate. I've used their dumps, and one task was to generate a paragraph summary. The dumps themselves use wiki markup which obviously adds an entirely new level of complexity. There are dumps of "summaries" but they're fairly broken, seemingly due to an ever evolving wiki markup syntax. I believe there are other ways to parse them though, which involves downloading a bunch of other people's code.

So if someone were to scrape the front end for the first paragraph element or whatever, it may make their life easier.

I’ve just taken to blocking entire swaths of cloud services IP networks. I don’t care what the intentions are, my personal sites don’t get the infinite bandwidth to put up with a thousands of poorly written spiders.
Is there a public list of those address blocks, which you'd recommend?
Not that I know of, but each service seems to publish a list (some in text, some JSON). I’ll reply later with the URLs of the ones I have.
This is what I have, see another reply for shared IP lists:

  https://ip-ranges.amazonaws.com/ip-ranges.json

  https://www.digitalocean.com/geo/google.csv
  
  https://www.gstatic.com/ipranges/cloud.json
I also found this but haven't validated it yet: https://github.com/femueller/cloud-ip-ranges
Set up a honeypot, or more like a booby trap, and boldly ban all IPs that access it.

Then you can consider banning OVH, DO, AWS, GCP, Oracle, China, Russia.

Honeypot is a good idea, but not for my immediate little one-server Web site startup.

On blocking country address ranges, my idealist side hopes that doesn't prove necessary. I personally know nice people in both of those countries.

It's just an inevitability due to poor abuse handling (or lack thereof) in those countries.

Some people might be nice but it's a minuscule part of the absolute flood of malicious traffic originating from those countries.

If people in those countries do not like such treatment, I'm so sorry, but they should force their ISPs to clean up their act. It's insane.

I use a VPN when bittorrent is running, and I've found that several websites outright block me "for security reasons." They like to show me my IP address, too, like a great secret has been revealed and the SWAT team is on their way.