Hacker News new | ask | show | jobs
by azalemeth 1722 days ago
Honest question: there is a famous and very funny stack exchange answer on the topic of parsing html with a regex [1] that states that the problem is in general impossible and if if you find yourself doing this, something has gone wrong and you should re-evaluate your life choices / pray to Cthulu.

So, does this apply to URLs? The fact that these regexes are....so huge...makes me think that something is fundamentally wrong. Are URLs describable in a Chomsky Type 3 grammar? Are they sufficiently regular that using a Regex is sensible? What do the actual browsers do?

[1] https://stackoverflow.com/questions/1732348/regex-match-open...

3 comments

Caveats: I know nothing of Chomsky Grammars, and I have only a passing familiarity with Cthulu, but IMO the real crux of the issue parsing html with regex (beyond all the “it’s hard”, “the spec is more complicated than you think”, “regex is impossible to read” etc.) is html is a recursive data structure, e.g. you can have a div, inside a div, inside a div ad infinitum. Regex, AFAIK, doesn’t allow you to describe recursion, so you’re left with regex plus supporting code. You’ll then have an impedance mismatch between the two.

URLs are not recursive structures, so I’d say the single hardest feature of html is not present.

The times I had to use it on HTML , I think I combined xPath with RegEx to close the mismatch.
I haven't looked at the BNF(s) for URIs lately, but I don't recall there being any recursion, so I wouldn't be surprised if the language were regular.

There was a Perl program that would take something like a BNF and barf out a gigantic regex (maybe with some maximum depth).

>So, does this apply to URLs? The fact that these regexes are....so huge...makes me think that something is fundamentally wrong

Yes, if your regex is above {.../50/100/...} characters, then write parser.

I struggle to understand why do people write those crazy regexes for emails, urls, html when probably in all popular technologies there are battle-tested parsers for those things.

On top of this the error messages with a regex will be very one-dimensional.

As an example, http://localhost/ is technically valid url, which he wants to block. Should this error say misformatted URL like all others?

Using regex to cover all such cases is really the wrong tool for the job.

Sometimes you're given an arbitrary bag of bytes with best-effort well-formed data. Regexes are gross but quite good for those cases where you need to try to rip out some bits from the data abyss.