Hacker News new | ask | show | jobs
by cscottnet 2321 days ago
I'm on the team. Part 2 of this post series should have lots of interesting technical details for y'all; be patient, I'm still writing it.

But to whet your appetite: we used https://github.com/cscott/js2php to generate a "crappy first draft" of the PHP code for our JS source. Not going for correctness, instead trying to match code style and syntax changes so that we could more easily review git diffs from the crappy first draft to the "working" version, and concentrate attention on the important bits, not the boring syntax-change-y parts.

The original legacy Mediawiki parser used a big pile of regexps and had all sorts of corner cases caused by the particular order in which the regexps were applied, etc.

Parsoid uses a PEG tokenizer, written with pegjs (we wrote a PHP backend to pegjs for this project). There are still a bunch of regexps scattered throughout the code, because they are still very useful for text processing and a valuable feature of both JavaScript and PHP as programming languages, but they are not the primary parsing mechanism. Translating the regexps was actually one of the more difficult parts, because there are some subtle differences between JS and PHP regexps.

We made a deliberate choice to switch from JS-style loose typing to strict typing in the PHP port. Whatever you may consider the long term merits are for maintainability, programming-in-the-large, etc, they were extremely useful for the porting project itself, since they caught a bunch of non-obvious problems where the types of things were slightly different in PHP and JS. JS used anonymous objects all over the place; we used PHP associative arrays for many of these places, but found it very worthwhile to take the time to create proper typed classes during the translation where possible; it really helped clarify the interfaces and, again, catch a lot of subtle impedance mismatches during the port.

We tried to narrow scope by not converting every loose interface or anonymous object to a type -- we actually converted as many things as possible to proper JS classes in the "pregame" before the port, but the important thing was to get the port done and complete as quickly as possible. We'll be continuing to tighten the type system -- as much for code documentation as anything else -- as we address code debt moving forward.

AMA, although I don't check hacker news frequently so I can't promise to reply.

2 comments

One question: did you investigate why the PHP version was so much faster than the JS one ? Do you think the performance gains of the PHP versions could be achieved in JS, or do you use any special feature of the PHP interpreter ?
No, we haven't investigated it yet since we haven't had the time to do it. But, we've filed a task for maybe someone to look at it ( https://phabricator.wikimedia.org/T241968 ), but our hunch is that it would be be incorrect to conclude that PHP is faster from JS.
A slightly longer answer is that we looked into a number of possible reasons and it's not clear there's an easy answer. Lots of differences between the two setups, and every time we come up with an possible answer like "oh, it's the reduced network API latency" we come up with a counter like "but html2wt is also faster and it does barely any network requests". Casual investigation raises more questions than answers. So we've looked into it but don't yet have an answer that we fully believe.
If you want to dig through the history some: https://github.com/wikimedia/parsoid/blame/6eb00df3e090b20cc... Is a pretty good example of the porting technique. You'll see quite a decent number of lines are still unchanged from the "automatic conversion from JS". https://github.com/wikimedia/parsoid/commit/6eb00df3e090b20c... shows what the initial port process was like. Still quite a bit of work, but you'll see it's almost all "real" work that needs a human to think about things, not just mechanical syntax translation. The syntax translation part was done automatically.

Then https://github.com/wikimedia/parsoid/commits/master/src/Ext/... is a not-too-atypical view of the process after the "intial working port" was done (post Aug 2019). Some nasty bugs fixed (https://github.com/wikimedia/parsoid/commit/34fcb4241aa0f3a0... a GC bug in PHP!), some more subtle bugs (PHP's crazy behavior of '$' at the end of a regexp, unless you use the 'D' flag), etc.

If you look through the history earlier in 2019, you'll even see JS commits like https://github.com/wikimedia/parsoid/commit/2853a90ceda7cdfa... which are to the JS code (in production at the time) preparing the way for the PHP port. In that particular case, our tooling was doing offset conversion between JS UTF-16 and PHP UTF-8 as part of the output-testing-and-comparison QA framework we'd built for the port, and it was getting hugely confused by Gallery since Gallery was using "bogus" offsets into the source text. Since fixing the offsets was rather involved (the patchset for this commit in gerrit went through 56 revisions : https://gerrit.wikimedia.org/r/505319 ) the change was first done on the JS side, thoroughly tested, and deployed to production to ensure it had no inadvertent effects, before that now-better JS code was ported to PHP. It would have been a disaster to try to make this change in the PHP version directly during the port.