| HN Mirror

This depends on the definition of compliant. By the paper's feature-based approach, Parsoid would be 'compliant' with the PHP parser.

There is more to compliance than a simple feature comparison though, most of which can only be identified by large-scale testing. Each night, we have been running tests on a sample of 160k articles from 16 languages to check our progress. In this test setup, 99.99995% of articles round-trip perfectly from wikitext to HTML and back. Currently the focus is on visual diffing to identify remaining rendering differences.

There is still a good amount of work left until Parsoid is ready to replace the PHP parser, but most of this is actually not relevant to you if all you'd like to do is extract semantic information.

Regarding the paper: The authors correctly describe some of the issues inherent in wikitext parsing. Its conclusions are however based on strong assumptions about the implementation strategy. For example, they do not consider the option of flattening a PEG parse tree back to tokens in order to implement context-sensitive and generally unbalanced parts of the syntax. Similarly, the analysis of the parsing complexity seems to assume a lack of transclusion limits.