| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wikiburner 4833 days ago

Hey everybody, fauigerzigerk sort of gets into this, but I just downloaded the dump yesterday expecting there to be a relatively straightforward way to parse and search it with Python and extract and process articles of interest w/ NLTK.

I'm not sure what I was expecting exactly, but it sure wasn't a single 40gb XML file that I can't even open in Notepad++.

Is my only real option (for parsing and data mining this thing) to basically set up a clone of wikipedia's system, and then screen scrape localhost?

2 comments

raphman 4833 days ago

If you just need a plain text copy of Wikipedia: http://kopiwiki.dsd.sztaki.hu/

> We found that it is impossible to download the whole database in an easy to handle format (like HTML or plain text) and that all the available Mediawiki converters had some flaws. So we have written a Mediawiki XML dump to plain text converter, which we run every time a new database dump appears on the site and publish the text version for everybody to use.

link

fauigerzigerk 4833 days ago

It's not your only option. You can open the XML dump with a streaming XML parser (not a DOM parser) and use one of the existing wiki syntax parsers to extract what you need. If you just need a few specific items (for instance just the links to reconstruct the page graph or just the info boxes) that's a perfectly workable solution. There is a large number of small tools and scripts that extract various bits and pieces from the XML dump. You may well find a tool that suits your needs.

But there are two issues: The available parsers are not very robust and not very complete, because the wiki syntax is extremely convoluted and there is no formal spec. Second, the wiki syntax includes a kind of macro system. Without actually executing those macros you don't get the complete page as you see it online. The only way to get the complete and correct page content, to my knowledge, is to install the mediawiki site and import the data.

If you just want to look at the XML dump quickly you can use less or tail.

link

yareally 4833 days ago

> because the wiki syntax is extremely convoluted and there is no formal spec

I ran into that when parsing out pages with Python for an app I am working on. Parsing it by conditions leads to a lot of conditions for edge cases, which as one might think happen more often as the more obscure the topic gets due to not being updated or improved to be more inline with the formatting of trafficked articles. If you are looking for something in particular, ranking elements on a page helps to a point if the elements you want are the ones that occur the most or near to it.

Aside from more obscure, less trafficked articles, I noticed many of the Non-English wiki articles are also formatted in awkward ways and appear far less updated to their English counterparts. I thought I had most edges cases covered until I started parsing out wiki markup for other languages.

link

fauigerzigerk 4833 days ago

Ah, thanks for the warning. I haven't even touched the non-english articles yet :/

link

yareally 4833 days ago

If you plan on doing both, which is pretty easy to do with their API (as you can grab all the potential languages from an article and the URL), I think I would think of testing against foreign languages first and then English once you have a basic parser going and search. Non-English had more weirdness, but it happened more often, so it became easier to eliminate similar cases in English articles that may happen more infrequent.

I ended up doing a lot of massive unit testing against various edge cases to make sure things were working. Even with that still, I would try to log any anomalies and put them aside for manual inspection later (by running checks on what "good" data should look like), just to be safe.

link

ninjin 4833 days ago

> The available parsers are not very robust and not very complete, because the wiki syntax is extremely convoluted and there is no formal spec. Second, the wiki syntax includes a kind of macro system. Without actually executing those macros you don't get the complete page as you see it online. The only way to get the complete and correct page content, to my knowledge, is to install the mediawiki site and import the data.

Precisely this makes Wikipedia a pain to work with for text mining. I thought I had found a great option when I found the Freebase WEX dump [1] of Wikipedia that is in pure XML, but they have issues of their own with duplicated text etc. due to all the silliness in the original MediaWiki markup. If I am going for trying to extract the article texts again I may try the DBPedia long abstract dumps [2].

I am not sure what people really do when they utilise Wikipedia articles for research and applications. But I assume they just do their best and try to get the cleanest possible text out of that enormous mess of mark-up (did I mention that the specification is the implementation itself?). If anyone has a good way to get raw text without any mark-up out of Wikipedia I will gladly send you a postcard expressing my gratitude. It just makes me sad that we have an enormous resource that is well-curated and we are stuck in the mud because of a stupid engineering decision in the early history of MediaWiki.

[1]: http://wiki.freebase.com/wiki/WEX [2]: http://wiki.dbpedia.org/Downloads38

link

wikiburner 4833 days ago

Hi fauigerzigerk, thanks for your response. Unfortunately, I'm going to need pretty much the entire page structure, content, links/graph of each article, so it looks like I might have to go the MediaWiki route.

I'm noticing now that the dump is available as SQL as well, so maybe I'll check that out as well and see if a more streamlined approach is possible that way.

Also, regarding less or tail, unfortunately I'm on Windows.

Anyway, thanks again for your help, I appreciate it.

link