Hacker News new | ask | show | jobs
by gnosygnu 4583 days ago
That's pretty impressive. I never had the patience to sit through a full MediaWiki import for en.wikipedia.org.

Just to be clear, XOWA isn't an installer for MediaWiki, but it's own app. This allows it to avoid the dependency on the entire MediaWiki tool-chain (apache, php, mysql, MediaWiki). Unfortunately, this means that XOWA has to reproduce the same logic, which is quite a challenge...

2 comments

It is indeed a challenge. The mediawiki syntax is the weirdest mess I have ever had to parse. There is no spec, real world usage deviates significantly from the help docs, and it's a Turing complete language with heaps of backwards compatibility hacks. So if you have something reasonably complete and correct than kudos to you!
Thanks. The syntax was challenging, especially all the template syntax ("{{my_template|{{{argument1|defaultvalue|{{nested_template}}}}}}}"). Fortunately, the new lua module should eventually replace the template syntax, which should make it easier for future parsers.
The visual editor uses a new parser, Parsoid, which has been implemented separately in node.js (iirc). That may be the answer...
Yup. It also has its own DOM, rather than continuously adding to one string and repeatedly running regex's on it (which is what MediaWiki does today).

I was already pretty far along with my own parser before Parsoid was usable though. (and my parser has its own DOM / hooks)

MediaWiki is such an astoundingly fugly piece of software.
Wouldn't it be easier to include all the original tools in a packaged form instead of reproducing their logic?
Yes, this would be the ideal approach, but it can become quite complicated (b/c the tool-chain needs to be installed for different machines). In addition, the official.xml importer (importDump.php) is not really up to the task (slow / sometimes buggy).

If you're interested in going this route, you can look at http://www.nongnu.org/wp-mirror/. This should build a local MediaWiki instance with one click. Keep in mind that it's a bit slow: it takes two days to build simple.wikipedia.org with images. In contrast, XOWA sets this up in about 30 min