Hacker News new | ask | show | jobs
by name_censored_ 5458 days ago
>Now I understood the reason _why_ you can't use regular expressions to parse HTML is that HTML is usually not regular. Is this true?

I believe the reason is that HTML is a Type 2 grammar by Chomsky hierarchy (that is, a push-down automaton), whereas regexp is a Type 3 grammar (that is, a finite state automaton). To put it simply, HTML has a frame/state stack, and regexp isn't advanced enough for that (instead it reads "left-to-right" - no subroutines or recursion).

http://en.wikipedia.org/wiki/Chomsky_hierarchy

I suspect you might be right about him "cheating" using perl, but not knowing a lick of perl, I can't say for sure one way or the other.

Edit: Apparently, back-references mean regexp isn't regular - that actually makes more sense now; they've never quite meshed with my understanding of regular languages.