Hacker News new | ask | show | jobs
by mjn 3957 days ago
There is a conceptual version of HTML that's non-regular, but HTML as used in practice is regular, because the standard allows parsers to impose a maximum nesting depth, and all browsers that I know of do so (e.g. Webkit limits to 512). Every finite language is regular, so HTML with a finite tag-nesting depth is regular. Therefore you could in principle parse the subset of HTML recognized by Webkit using a regular expression.

It's impractical and error-prone to do so, of course, but that's not a computability issue. You could write a compiler, if you really wanted to, that took a context-free grammar plus an integer specifying maximum production depth, and mechanically converted it to an FSM or regex. Therefore I think the article is barking up the wrong tree with computability; the issue isn't what's computable by a finite vs. pushdown automaton, but that parsing nested data structures with regexes is virtually impossible to do correctly.

1 comments

I think you are technically correct, but your proof contains a non-sequitur: limiting tag-nesting depth doesn't limit the size of a string accepted as HTML, because the textual contents of any tag (or an attribute, or sequenced rather than /nested/ tags) may be of unlimited length.
It's true anyway. The limited depth means it has a finite number of states, regardless of the length, and that's all you need for it to be regular.

But this is a technicality and it should in no way indicate that a regex for HTML would be a good idea.

The reason I think it's useful to note, even though you would not in practice parse it that way, is that in my opinion it better points to why parsing HTML with regexes is a bad idea. It would be possible to do so with a giant regex, but there are two problems: 1) manually writing regexes to parse nested data structures is bug-prone; and 2) finite-automaton size scales very badly with the choice of nesting limit, which is more of a complexity rather than computability issue.

Say you had a recursive data structure with a very small maximum nesting depth, like 5. Should you use a regex then? I would argue still no: there's no computational problem, but writing a correct regex to do so is still bug-prone. At least, writing one manually is. In the case of small finite nesting depths there might occasionally be reasons to mechanically compile something that looks more like EBNF to a DFA or NFA. But something else might well be better. At that point it's just an efficiency question.