Hacker News new | ask | show | jobs
by echelon 2323 days ago
I'm actually curious why PHP was chosen instead of Rust or Go given that the parsing team wasn't familiar with the language. I understand that MediaWiki is written in PHP, but it sounds like they were already comfortable with language heterogeny.

They claim,

> The two wikitext engines were different in terms of implementation language, fundamental architecture, and modeling of wikitext semantics (how they represented the "meaning" of wikitext). These differences impacted development of new features as well as the conversation around the evolution of wikitext and templating in our projects. While the differences in implementation language and architecture were the most obvious and talked-about issues, this last concern -- platform evolution -- is no less important, and has motivated the careful and deliberate way we have approached integration of the two engines.

Which is I suppose a compelling reason for a rewrite if you're understaffed.

I'd still be interested in writing it in Rust and then writing PHP bindings. There's even a possibility of running a WASM engine in the browser and skipping the roundtrip for evaluation.

3 comments

> I'm actually curious why PHP was chosen

From the article: "However, by 2015, as VisualEditor and Parsoid matured and became established, maintaining two parallel wikitext engines in perpetuity was untenable"

They didn't write it in PHP for speed, that was merely a side effect. They wrote it in PHP so they could have a single language for the system.

> Parsoid/PHP also brings us one step closer to integrating Parsoid and other MediaWiki wikitext-handling code into a single system, which will be easier to maintain and extend.

I assume that Wikimedia works on a rather tight budget. Choosing (and unifying on) tech stacks with a larger supply in devs seems to be an economically reasonable choice.

It's more complicated than that. MediaWiki is PHP based because back when it was developed PHP was everywhere. Since then the world has moved on, but PHP still powers a huge percentage of the web via things like WordPress.

The other side to using PHP was having support in other host providers. Wikipedia is not the only installation of MediaWiki and there has been consideration in the past for those installing MediaWiki on shared hosts where you don't necessarily have root access to install things like node. Moving forward that's less of a concern because you can containerise MediaWiki (and the other services), but not even Wikimedia run that in production yet AFAIK.

However, even if they weren't budget constrained (which they aren't) unifying on a single language used by the majority of their devs isn't a bad idea, especially when the effort to port the entire stack to a new language would be unjustifiable.

and... migrating an entire codebase to something new because there's a subset of devs that jump between tech stacks and want 'newer' stuff isn't an economically reasonable choice.

server-side JS was a thing 10 years ago, but it didn't offer enough benefits to switch. same with python, java, ruby - all existed, but didn't offer enough benefits to switch then, and probably still don't now.

also, what would be a "larger supply"? C? Java? C#? JS? PHP has a huge supply of developers at all skill levels, which may make it just as easy (or easier) in finding the talent they need. And... hey - they wrote that initial parsoid in JS and... they've doubled the speed by converging on PHP.

Huh. What ED of Wikimedia Foundation even does?
Wikimedia is swimming in donations. More than $100,000,000 yearly since 2017/2018.
Probably parser is bunch of regex-es that noone understands. So they just converted to code to php without touching the expressions.
My suspicion is correct - code is full of things like: /\[\[([^\[\]])\]\]|\{\{([^\{\}])\}\}|-\{([^\{\}]*)\}-/
I never understood why people find regex so intimidating. Obviously you probably didn't look to find the worst of all, but one you posted is very straightforward.
You jest, but that regex looks machine-generated. My Emacs is full of these in places used for syntax coloring, but I know these are optimized. There's an elisp function, regex-opt, into which you can throw a bunch of strings, and you get out a regex like above.
To be honest I was serious. Personally I believe that regular expressions is one of few tools that super useful even for people outside of IT because everyone have to extract of format some text or table data from time to time. You can even learn them just by playing game:

https://regexcrossword.com/

The example quoted required some mental work to unparse, so I assumed you're joking.

But in general, I agree with you. Regular expressions aren't hard, and there's no excuse for not learning to read and use them.

Regex are dreaded as difficult to comprehend, but the real danger in using them is more subtle - especially nowadays when you'd have most text as UTF-8, possibly escaped, etc. and regex are prone to misbehave in odd ways, and introduce security issues - they should only be handled by expert programmers. Even parsing apparently simple stuff like email addresses, IP addresses, phone numbers and date/time is tricky, far beyond what a newbie would expect. There's a reason we have dedicated validation functions in PHP for all of the above. That said, regex have their use case too, and if your parsing case is not covered by a dedicated function, are usually the best option.
I never understood why people who understand regex don’t understand people who don’t understand regex. Obviously you are not the worst of all, but it’s not that hard to imagine how a regex looks to someone who doesn’t know regex, is it?
I couldn't agree more. I know regex fairly well and parsing regex is still annoying and takes a lot more concentration than just reading normal code.

Plus there are so many cases where people build insane regex where they are just the wrong tool for the job, e.g. parsing/extracting or manipulating HTML. It always starts out with "I just need the src from that <img>, what could go wrong" and ends in despair, because you never just need that src and you never only deal with perfect html and you'd be done already if you had just used some dom parser.

Yeah, I get that regular expressions might look complex and tangled like brainfuck looks for me since I never tried to learn it. Yet I just see comments on how regular expressions are hard to understand from all kind of IT people who solving hundred times more complex puzzles every day. I guess it's just reputation that stick to certain technology and really have nothing to do with actual complexity.
Experience i guess. I've spent hundreds of hours on debugging and fixing regexes that other people wrote - usually just to find there's a quirk in certain regex parser implementation.

Regexes are easy to understand if you write them, but reading them can take lots of time.

Note that HN formatting messed it up (there are stars missing before the first two closing parens). The regex itself is indeed quite straightforward, just a bit hard to read due to all the required backslash-escaping.