|
|
|
|
|
by fauigerzigerk
4834 days ago
|
|
It's not your only option. You can open the XML dump with a streaming XML parser (not a DOM parser) and use one of the existing wiki syntax parsers to extract what you need. If you just need a few specific items (for instance just the links to reconstruct the page graph or just the info boxes) that's a perfectly workable solution. There is a large number of small tools and scripts that extract various bits and pieces from the XML dump. You may well find a tool that suits your needs. But there are two issues: The available parsers are not very robust and not very complete, because the wiki syntax is extremely convoluted and there is no formal spec. Second, the wiki syntax includes a kind of macro system. Without actually executing those macros you don't get the complete page as you see it online. The only way to get the complete and correct page content, to my knowledge, is to install the mediawiki site and import the data. If you just want to look at the XML dump quickly you can use less or tail. |
|
I ran into that when parsing out pages with Python for an app I am working on. Parsing it by conditions leads to a lot of conditions for edge cases, which as one might think happen more often as the more obscure the topic gets due to not being updated or improved to be more inline with the formatting of trafficked articles. If you are looking for something in particular, ranking elements on a page helps to a point if the elements you want are the ones that occur the most or near to it.
Aside from more obscure, less trafficked articles, I noticed many of the Non-English wiki articles are also formatted in awkward ways and appear far less updated to their English counterparts. I thought I had most edges cases covered until I started parsing out wiki markup for other languages.