Hacker News new | ask | show | jobs
by aranchelk 1722 days ago
Caveats: I know nothing of Chomsky Grammars, and I have only a passing familiarity with Cthulu, but IMO the real crux of the issue parsing html with regex (beyond all the “it’s hard”, “the spec is more complicated than you think”, “regex is impossible to read” etc.) is html is a recursive data structure, e.g. you can have a div, inside a div, inside a div ad infinitum. Regex, AFAIK, doesn’t allow you to describe recursion, so you’re left with regex plus supporting code. You’ll then have an impedance mismatch between the two.

URLs are not recursive structures, so I’d say the single hardest feature of html is not present.

1 comments

The times I had to use it on HTML , I think I combined xPath with RegEx to close the mismatch.