Hacker News new | ask | show | jobs
by gwicke 4258 days ago
Nothing beats being told that what you just did is in fact not feasible ;) Check out this page:

https://www.mediawiki.org/wiki/Parsoid

This is the bidirectional conversion engine between wikitext and HTML+RDFa that powers VisualEditor and several other tools. It tracks source range to DOM structure correspondence as proposed in the post. At this point, the IDE is basically a user interface and performance problem. The conversion is readily available through a REST interface, but on the largest articles parsing from modified wikitext to HTML can take around 10 seconds. Most of that time is spent in the expansion of the myriad of citation templates that we like people to add. It is possible to speed this up to something more usable for an IDE, but it's not trivial.

You might also be interested in this blog post about some of the challenges we encountered while building Parsoid: http://blog.wikimedia.org/2013/03/04/parsoid-how-wikipedia-c...

3 comments

Hello! OP here. I'm totally ignorant of Parsoid but I respectfully suggest that bidirectional lossless conversion is not possible in general.

First, it relies on the function Wikitext->HTML being injective. But isn't it trivial to create two different Wikitexts that compile to the same HTML? Whitespace is just the start of this story.

Second, apparently the template language is Turing-complete. Let's say I write a prime sieve in order to generate a page that lists the first 100 prime numbers. What would it then mean to edit "31" to change it to "30"?

(With apologies for not yet having read the things you kindly linked to.)

> First, it relies on the function Wikitext->HTML being injective. But isn't it trivial to create two different Wikitexts that compile to the same HTML? Whitespace is just the start of this story.

Yes, and Parsoid works around this by preserving some metadata about wikitext in HTML (such as information about whitespace around syntax elements) and, since this preservation isn't perfect, only reserializing HTML→wikitext where the content was changed during editing.

> Second, apparently the template language is Turing-complete. Let's say I write a prime sieve in order to generate a page that lists the first 100 prime numbers. What would it then mean to edit "31" to change it to "30"?

Assuming we're talking about VisualEditor, you currently just can't do that (you can only delete the entire template inclusion and replace it with normal text, or edit template parameters).

(As a nitpick, wikitext is not Turing-complete (there is no loop or recursion construct, recursion is explicitly checked for and causes an error), you can only write complicated algorithms by manually unrolling enough loops. However, for some time now you can also write templates in Lua, which is a proper Turing-complete programming language, see <https://www.mediawiki.org/wiki/Extension:Scribunto>.)

>Second, apparently the template language is Turing-complete. Let's say I write a prime sieve in order to generate a page that lists the first 100 prime numbers

You can't. Using recursion (even primitive) is discouraged, there are even automated safeguards that make it hard (but I don't remember what they are exactly).

But I agree with you in general, I think making a visual editor that deals with templates correctly is not an easy task.

You and I know that it was just barely feasible, and if only MW had started with a parser rather than a series of regular expressions, we'd have had a visual editor in 2005 ...
Sure, it claims to do this, but I'm still a bit skeptical that this is actually compliant. What do you make of the paper I linked to? That blog post only glosses things like context-sensitivity.
This depends on the definition of compliant. By the paper's feature-based approach, Parsoid would be 'compliant' with the PHP parser.

There is more to compliance than a simple feature comparison though, most of which can only be identified by large-scale testing. Each night, we have been running tests on a sample of 160k articles from 16 languages to check our progress. In this test setup, 99.99995% of articles round-trip perfectly from wikitext to HTML and back. Currently the focus is on visual diffing to identify remaining rendering differences.

There is still a good amount of work left until Parsoid is ready to replace the PHP parser, but most of this is actually not relevant to you if all you'd like to do is extract semantic information.

Regarding the paper: The authors correctly describe some of the issues inherent in wikitext parsing. Its conclusions are however based on strong assumptions about the implementation strategy. For example, they do not consider the option of flattening a PEG parse tree back to tokens in order to implement context-sensitive and generally unbalanced parts of the syntax. Similarly, the analysis of the parsing complexity seems to assume a lack of transclusion limits.