| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hoppipolla 5317 days ago

Right, WebKit shipped code based on the spec first but the spec itself underwent subsequent revision as the Gecko / Presto implementors found site compatibility issues and bugs. I think the WebKit implementation was recently updated to the spec, so we are at, or at least very close to, having very interoperable HTML parsing in Opera/Firefox/Safari/Chrome. I also believe that Microsoft are aiming to implement the new algorithm in IE 10.

From an interoperability point of view the HTML parsing algorithm is the poster child for the success of the HTML effort; there is a testsuite of several thousand tests [1] (also submitted to the W3C [2]) that has contributions from multiple browser vendors and a number of unaffiliated individuals. Although parsing isn't sexy in the way that, say, <canvas> is, getting interoperable parsing makes it much easier to create cross-browser content (at Opera we closed a huge number of site-compatibilty bugs when we landed the new algorithm).

There are also a few open-source implementations that are not tied to browsers e.g. for python (and kind of also PHP) [3], for java [4] (fun fact: the gecko C++ implementation is generated from that java implementation) and javascript [5] https://github.com/andreasgal/dom.js It would be great to see more conforming implementations for other languages, or to see libraries like libxml2 that have existing ad-hoc HTML parsers update their implementations to match the spec.

[1] http://code.google.com/p/html5lib/source/browse/#hg%2Ftestda...

[2] http://w3c-test.org/html/tests/submission/Opera/html5lib/

[3] http://code.google.com/p/html5lib/

[4] http://about.validator.nu/htmlparser/

[5] https://github.com/andreasgal/dom.js