Hacker News new | ask | show | jobs
by TheDong 2811 days ago
I disagree with that takeaway.

Technically PCRE regexes are powerful and can match anything.

In reality, a complex PCRE regex will almost always be more difficult to maintain than a parser-combinator or hand-rolled parser in a more traditional language.

People saying "Regexes can't match HTML, use an html library" are wrong to say regexes are incapable of it, but they're right to say to use a library meant for the job.

Sure, you can use this regex to match emails (http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html), but using a more normal parsing language than PCRE regex will result in more readable code.

The same is true for almost any regular expression that takes advantage of PCRE features, especially backreferences.

In addition, a regexp will only match html correctly if you write a very complex one. With a naive regexp for an html tag's contents, you'll find that you still might match that text inside a <script> tag even though that is not html.. so you now need to figure out when you're in a script tag and exclude that, or if you're inside an html attribute string, and before you know it you have a 2000 character regexp that no one else will be able to read, all because you didn't want to use an html parsing library where getting a tag's value correctly would be a single xpath expression or css selector away.

4 comments

There's no arguing the fact that regexes are a poor fit for HTML, but maybe this is the wrong time to use that ridiculous email regex as an example, since TFA features a highly readable, fully compliant email matching regex as its main example.
It also doesn't point out that matching email addresses in general is a nightmare because the standard is one of those "we'll just allow everything everybody is doing right now" type standards that have a million different little quibbles.

No matter what language or programming style you use it's going to be ugly because it's an ugly problem.

What the language contains as a main example is PCRE pattern that matches email addresses and in comparison to it's original EBNF incarnation is highly unreadable due all the syntax noise required to graft backtracing support onto traditional regex syntax (not to mention the fact that it's performance is certainly highly suboptimal).
The performance might be better in some cases actually.

In the case of perl, where regexps performance has been optimized for significantly, the regexp actually performs better than a more normal parser.

From http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html:

> It provides the same functionality as RFC::RFC822::Address, but uses Perl regular expressions rather that the Parse::RecDescent parser. This means that the module is much faster to load as it does not need to compile the grammar on startup.

Of course, if perl were a statically compiled language, the cost of compiling the grammar could be done at compile time.

Perl5 is not too meaningful for such performance comparisons, because on one hand it's regex implementation is very optimized while on other hand performance of Perl5 on "normal" procedural code is horrible (eg. Perl5 is about an order of magnitude slower on Gabriel's Takeuchi function benchmark than CPython).
This result generalises to most interpreted languages though. PHP, python, javascript, etc all have highly optimised regex engines, and regexs can consequently be a good optimisation technique when using those languages.
Using regex to match email addresses isn't actually a good idea either.

https://blog.onyxbits.de/validating-email-addresses-with-a-r...

> People saying "Regexes can't match HTML, use an html library" are wrong to say regexes are incapable of it, but they're right to say to use a library meant for the job.

Plot twist: the html library is built upon regex's (at least in part).

Every parser is partially built upon regexes. You have to go all the way until Haskell, Prolog or such languages before you get better options than regexes to build them.

But they are not built solely of regexes. They always have added control structures that complement regexes on those place they are worst.

> Sure, you can use this regex to match emails

Or you can use comments and named groups: https://stackoverflow.com/a/1917982

Even real regular regexes can be used when nesting is limited, which is true for most real-world html, xml, json. Still you're better off using libraries.