Hacker News new | ask | show | jobs
by harryf 2323 days ago
Years ago I wrote the re-wrote wiki parser for Dokuwiki (which is used at https://wiki.php.net/ among other places). Originally the parser was scanning a wiki page multiple times using various regular expressions. I used a stack machine as a way to manage the regular expressions, which resulted in being able to parse a page in a single pass - it's documented here - https://www.dokuwiki.org/devel:parser

A nice (unexpected) side effect is it became much easier for people extend the parser which their own syntax, leading to an explosion of plugins ( https://www.dokuwiki.org/plugins?plugintype=1#extension__tab... )

I'm no expert on parsing theory but I have the impression that applying standard approaches to parsing source code; building syntax trees, attempting to express it with context free grammar etc. is the wrong approach for parsing wiki markup because it's context-sensitive. There's some discussion of the problem here https://www.mediawiki.org/wiki/Markup_spec#Feasibility_study

Another challenge for wiki markup, from a usability perspective, if a user get's part of the syntax of a page "wrong", you need to show them the end result so they can fix the problem, rather than have the entire page "fail" with a syntax error.

From looking at many wiki parsers before re-writing the Dokuwiki parser, what _tends_ to be the case, when people try to apply context-free grammars or build syntax trees is they reach 80% then stumble at the remain 20% of edge cases of how wiki markup is actually used in the wild.

Instead of building an object graph, the Dokuwiki parser produces a simple flat array representing the source page ( https://www.dokuwiki.org/devel:parser#token_conversion ) which I'd argue makes is simpler write code for rendering output (hence lots of plugins) as well as being more robust at handling "bad" wiki markup it might encounter in the wild - less chance of some kind of infinite recursion or similar.

Ultimately it's similar discussion to the SAX vs. DOM discussions people used to have around XML parsing ( https://stackoverflow.com/questions/6828703/what-is-the-diff... ). From a glance at the Parsiod source they seem to be taking a DOM-like approach - I wish them luck with that - my experience was this will probably lead to a great deal more complexity, especially when it comes to edge cases.

1 comments

It appears that you were probably describing the lexer pass in your description of docuwiki. Indeed tokenization is a very hard problem for wikitext. We use a pegjs grammar for it, but it contains less of lookahead/special conditions/novel extensions, etc. It's hard. Wikitext is messy precisely because it was intentionally designed to be easy and forgiving to write.

Seems like we've learned many of the same lessons building our parsers. Markup parsers do seem to be a unique thing, not really like parsing either programming languages or natural languages. If we every meet I'm sure we could happily share a beverage of your choice trading stories.