| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ninjin 4834 days ago

> The available parsers are not very robust and not very complete, because the wiki syntax is extremely convoluted and there is no formal spec. Second, the wiki syntax includes a kind of macro system. Without actually executing those macros you don't get the complete page as you see it online. The only way to get the complete and correct page content, to my knowledge, is to install the mediawiki site and import the data.

Precisely this makes Wikipedia a pain to work with for text mining. I thought I had found a great option when I found the Freebase WEX dump [1] of Wikipedia that is in pure XML, but they have issues of their own with duplicated text etc. due to all the silliness in the original MediaWiki markup. If I am going for trying to extract the article texts again I may try the DBPedia long abstract dumps [2].

I am not sure what people really do when they utilise Wikipedia articles for research and applications. But I assume they just do their best and try to get the cleanest possible text out of that enormous mess of mark-up (did I mention that the specification is the implementation itself?). If anyone has a good way to get raw text without any mark-up out of Wikipedia I will gladly send you a postcard expressing my gratitude. It just makes me sad that we have an enormous resource that is well-curated and we are stuck in the mud because of a stupid engineering decision in the early history of MediaWiki.

[1]: http://wiki.freebase.com/wiki/WEX [2]: http://wiki.dbpedia.org/Downloads38