Hacker News new | ask | show | jobs
by taliesinb 5088 days ago
For anyone who is interested, I've written a parallelized Wikipedia spidering tool in Go: https://github.com/taliesinb/wikispider

It's for when you want to grab a small portion of the full Wikipedia graph without cutting yourself on the 30-odd gigabytes of XML the dumps provide.

1 comments

Does it crawl revision history as well? I can deal with the gigabytes - it's the terabytes that scare me! (i.e full revision dumps).
So you want a small sub-graph of Wikipedia, but you want the full revision history of each of the articles in that sub-graph? I don't think the MediaWiki API makes it possible to get the full revision history of an article as a single object, so you're probably better off operating on a dump.