|
|
|
|
|
by StableAlkyne
410 days ago
|
|
I tried using one of these dumps a year ago (wanted to play around and see what visualizations I could come up with based on text and the links between pages) and it was an incredibly unintuitive process. It's not clear which files you need, and the site itself is (or at least, was when I tried) "shipped" as some gigantic SQL scripts to rebuild the database with enough lines that the SQL servers I tried gave up reading them, requiring another script to split it up into chunks. Then when you finally do have the database, you don't have a local copy of Wikipedia. You're missing several more files, for example category information is in a separate dump. Also you need wiki software to use the dump and host the site. After a weekend of fucking around with SQL, this is the point where I gave up and just curled the 200 or so pages I was interested in. I'm pretty sure they want you to "just" download the database dump and go to town, but it's such a pain in the ass that I can see why someone else would just crawl it. |
|
More recently they starting putting the data up on Kaggle in a format which is supposed to be easier to ingest.
https://enterprise.wikimedia.com/blog/kaggle-dataset/