|
|
|
|
|
by name_censored_
5458 days ago
|
|
>Now I understood the reason _why_ you can't use regular expressions to parse HTML is that HTML is usually not regular. Is this true? I believe the reason is that HTML is a Type 2 grammar by Chomsky hierarchy (that is, a push-down automaton), whereas regexp is a Type 3 grammar (that is, a finite state automaton). To put it simply, HTML has a frame/state stack, and regexp isn't advanced enough for that (instead it reads "left-to-right" - no subroutines or recursion). http://en.wikipedia.org/wiki/Chomsky_hierarchy I suspect you might be right about him "cheating" using perl, but not knowing a lick of perl, I can't say for sure one way or the other. Edit: Apparently, back-references mean regexp isn't regular - that actually makes more sense now; they've never quite meshed with my understanding of regular languages. |
|