Hacker News new | ask | show | jobs
by gue5t 3957 days ago
I think you are technically correct, but your proof contains a non-sequitur: limiting tag-nesting depth doesn't limit the size of a string accepted as HTML, because the textual contents of any tag (or an attribute, or sequenced rather than /nested/ tags) may be of unlimited length.
1 comments

It's true anyway. The limited depth means it has a finite number of states, regardless of the length, and that's all you need for it to be regular.

But this is a technicality and it should in no way indicate that a regex for HTML would be a good idea.

The reason I think it's useful to note, even though you would not in practice parse it that way, is that in my opinion it better points to why parsing HTML with regexes is a bad idea. It would be possible to do so with a giant regex, but there are two problems: 1) manually writing regexes to parse nested data structures is bug-prone; and 2) finite-automaton size scales very badly with the choice of nesting limit, which is more of a complexity rather than computability issue.

Say you had a recursive data structure with a very small maximum nesting depth, like 5. Should you use a regex then? I would argue still no: there's no computational problem, but writing a correct regex to do so is still bug-prone. At least, writing one manually is. In the case of small finite nesting depths there might occasionally be reasons to mechanically compile something that looks more like EBNF to a DFA or NFA. But something else might well be better. At that point it's just an efficiency question.