Hacker News new | ask | show | jobs
by luckylion 637 days ago
To be fair, some 20 years ago when I wanted to do something with Wikipedia data, I scraped them too, after having tried quite a bit to use the dumps.

- dump availability was shaky at best back then (could see months go by without successful dumps)

- you had to fiddle with it to actually process the dumps

- you'd get the full wikipedia content, but you didn't have the exact wikipedia mediawiki setup, so a bunch of things were not rendered

- you couldn't get their exact version of mediawiki, because they added more than what was released openly

Now, I'm not saying that they were wrong to do that back then, and I assume things have improved. Their mission wasn't to provide an easy way to download & import the data so it wasn't a focus topic, and they probably ran more bleeding edge versions of mediawiki and plugins that they didn't deem stable enough for general public consumption. But it made it very hard to do "the right thing", and just whipping up a script to fetch the URLs I cared about (it was in Perl back then!) was orders of magnitude faster.

At least for me, had they offered an easy way to set up a local mirror, I would've done that. I assume this is similar for many scrapers: they're extremely experienced at building scrapers, but they have no idea how to set up some software and how to import dumps that may or may not be easy to manage, so to them the cost of writing a scraper is much smaller. If you shift that imbalance, you probably won't stop everyone from hitting your live servers, but you'll stop some because it's easier for them not to and instead get the same data from a way that you provided them.

1 comments

Can relate. I've used their dumps, and one task was to generate a paragraph summary. The dumps themselves use wiki markup which obviously adds an entirely new level of complexity. There are dumps of "summaries" but they're fairly broken, seemingly due to an ever evolving wiki markup syntax. I believe there are other ways to parse them though, which involves downloading a bunch of other people's code.

So if someone were to scrape the front end for the first paragraph element or whatever, it may make their life easier.