|
|
|
|
|
by flaburgan
640 days ago
|
|
I was recently speaking with people from OpenFoodFacts and OpenStreetMap, and I guess Wikipedia as the same issue. They are under constantly DDoS by bots which are scraping everything, even if the full dataset can be downloaded for free with a single HTTP request. They said this useless traffic was a huge cost for them.
This is not about copyright, just about bots being stupid and people behind them not caring at all. We for sure need a solution to this. To maintain a system online nowadays means not only they get your data but you pay for that! |
|
- dump availability was shaky at best back then (could see months go by without successful dumps)
- you had to fiddle with it to actually process the dumps
- you'd get the full wikipedia content, but you didn't have the exact wikipedia mediawiki setup, so a bunch of things were not rendered
- you couldn't get their exact version of mediawiki, because they added more than what was released openly
Now, I'm not saying that they were wrong to do that back then, and I assume things have improved. Their mission wasn't to provide an easy way to download & import the data so it wasn't a focus topic, and they probably ran more bleeding edge versions of mediawiki and plugins that they didn't deem stable enough for general public consumption. But it made it very hard to do "the right thing", and just whipping up a script to fetch the URLs I cared about (it was in Perl back then!) was orders of magnitude faster.
At least for me, had they offered an easy way to set up a local mirror, I would've done that. I assume this is similar for many scrapers: they're extremely experienced at building scrapers, but they have no idea how to set up some software and how to import dumps that may or may not be easy to manage, so to them the cost of writing a scraper is much smaller. If you shift that imbalance, you probably won't stop everyone from hitting your live servers, but you'll stop some because it's easier for them not to and instead get the same data from a way that you provided them.