|
|
|
|
|
by wikiburner
4833 days ago
|
|
Hey everybody, fauigerzigerk sort of gets into this, but I just downloaded the dump yesterday expecting there to be a relatively straightforward way to parse and search it with Python and extract and process articles of interest w/ NLTK. I'm not sure what I was expecting exactly, but it sure wasn't a single 40gb XML file that I can't even open in Notepad++. Is my only real option (for parsing and data mining this thing) to basically set up a clone of wikipedia's system, and then screen scrape localhost? |
|
> We found that it is impossible to download the whole database in an easy to handle format (like HTML or plain text) and that all the available Mediawiki converters had some flaws. So we have written a Mediawiki XML dump to plain text converter, which we run every time a new database dump appears on the site and publish the text version for everybody to use.