Hacker News new | ask | show | jobs
by rspeer 3957 days ago
What subset of HTML are you talking about here? HTML is not regular. Regexes can't do recursive things, such as matching opening and closing tags, and abusing regexes to sort of match HTML is generally considered a terrible idea.
3 comments

Regexes can't do recursive things

Basic regular expressions cannot, but regexes actually can. PCRE pioneered the technique AFAIK, and it later spread to Perl, Python, Ruby and other runtimes. Perl has this feature called lazy regular subexpressions which can be used to evaluate Perl expressions upon matching a subexpression, thus giving you the ability to recurse.

So, yes, some implementations of regular expressions are Turing complete because they run arbitrary code, but that is rather the opposite of a way to make parsers safer.

On that note, once you can run arbitrary functions on your matches, you could match /.*/ and then the function you run is html5lib.parse. Is that still a regular expression?

There is a conceptual version of HTML that's non-regular, but HTML as used in practice is regular, because the standard allows parsers to impose a maximum nesting depth, and all browsers that I know of do so (e.g. Webkit limits to 512). Every finite language is regular, so HTML with a finite tag-nesting depth is regular. Therefore you could in principle parse the subset of HTML recognized by Webkit using a regular expression.

It's impractical and error-prone to do so, of course, but that's not a computability issue. You could write a compiler, if you really wanted to, that took a context-free grammar plus an integer specifying maximum production depth, and mechanically converted it to an FSM or regex. Therefore I think the article is barking up the wrong tree with computability; the issue isn't what's computable by a finite vs. pushdown automaton, but that parsing nested data structures with regexes is virtually impossible to do correctly.

I think you are technically correct, but your proof contains a non-sequitur: limiting tag-nesting depth doesn't limit the size of a string accepted as HTML, because the textual contents of any tag (or an attribute, or sequenced rather than /nested/ tags) may be of unlimited length.
It's true anyway. The limited depth means it has a finite number of states, regardless of the length, and that's all you need for it to be regular.

But this is a technicality and it should in no way indicate that a regex for HTML would be a good idea.

The reason I think it's useful to note, even though you would not in practice parse it that way, is that in my opinion it better points to why parsing HTML with regexes is a bad idea. It would be possible to do so with a giant regex, but there are two problems: 1) manually writing regexes to parse nested data structures is bug-prone; and 2) finite-automaton size scales very badly with the choice of nesting limit, which is more of a complexity rather than computability issue.

Say you had a recursive data structure with a very small maximum nesting depth, like 5. Should you use a regex then? I would argue still no: there's no computational problem, but writing a correct regex to do so is still bug-prone. At least, writing one manually is. In the case of small finite nesting depths there might occasionally be reasons to mechanically compile something that looks more like EBNF to a DFA or NFA. But something else might well be better. At that point it's just an efficiency question.

Of course using regexes on general-purpose HTML is a bad idea. Using regexes to ensure the sanity of a restricted markdown converter isn't, especially if you want to err on the side of conservatism.