Hacker News new | ask | show | jobs
by jerf 3619 days ago
"On the other hand, using regexp to parse the URL when it's such an obviously security critical code path... just, why?!"

Why not? URIs are at least able to be tokenized perfectly well by a regular expression. You have to do it right, but there's little guarantee that your non-regexp code will do it right either. I glanced at that regexp and immediately recognized several potential problems with it... will I be able to do that with your non-regexp code?

To concretize the "several potential problems": 1. You generally don't want to parse arbitrary protocols, you should do something like (http|https|file) or whatever set of protocols you are ready to receive. Usually you're better off treating anything else as "not a URL", but consult your local security context for details. 2. Failing that, you want at the very least .⁎? to stop matching at the first :, or if your engine doesn't have that, the protocol ought to be matched with something much tighter like [a-z]+. And I do mean + and not ⁎, because you probably don't mean to support an empty protocol before the colon. (You may mean to permit URLs with no protocol, but that's (.⁎?:)? .) 3. Domains should be parsed more tightly than "not a slash". 4. Also, I have no idea what the @ was doing there. Perhaps it was trying to be $; URL parsing should always end with the "end of string" matcher to avoid problems similar to this. It should also start with the start-of-string matcher, which this one doesn't, for similar reasons. 5. Bonus critique, anything using regular expressions to URL-encode or decode is very suspicious; strongly prefer built-in functions that do this.

I literally saw all this faster than I could type it; does your non-regular-expression based code have this property?

Regular expressions aren't bad. They're hard to write properly, but still probably easier to write properly than anything else. It turns out the underlying problem is fundamentally hard.

(Had to use an alternate asterisk to get the RE expressions correct with HN trying to format it.)

4 comments

Heh.

https://mathiasbynens.be/demo/url-regex

https://lostechies.com/chadmyers/2010/11/20/parsing-a-url-wi...

https://stackoverflow.com/questions/27745/getting-parts-of-a...

What do all these have in common? They all demonstrate that it is hard to write a regex that parses URLs.

Regex's hide programming mistakes because they not only become harder for humans to parse as they get more complicated, but also there isn't simple programming code to handle potential security vulnerabilities in-line with the code. It also hides the [at least] two considerations of secure programming: designing a secure function, and handling the function securely.

It is hard to write code that parses URLs, period.

You can't look at regexs in isolation, see that a task is hard, then declare them unfit. You have to consider them as one of the many choices and analyze the cost/benefits of the whole suite of options.

I guarantee you that anyone who has said "Oh, gosh, this is hard, I'll just start using indexOf and substring operations" has written code that is just as broken, only in ways much harder to tell.

Which is probably why everyone here thinks it's better to not use regex. You didn't write better code... you wrote code that hid its brokenness better. That's not a good thing!

Again, my real point here is not "regexes are awesome in every way"... my point is that I literally glanced at that code and saw several ways in which it was wrong. Does your alternative have that property?

Also, some of the difficulties of regexes are accidental, not essential. Take something like the recent Perl 6 efforts for parsing and you're far better off in every way using that stuff than trying to bash together string-manipulation-based parsing, or whatever other alternatives you may be thinking of. The Perl 6 constructs will be more readable and more maintainable. (Perl 6 is crazy in a lot of ways but the parsing support is best-of-breed.)

Regular expressions describe finite state machines and in programming there's nothing simpler than finite state machines / finite automatons. Your handling of vulnerabilities inline is anything but simple. This is CS 101.
The handling of security is anything but simple. Hence, a simple solution is anything but secure. This is Security 101.
I can't even parse that. First sentence is false. Second sentence wouldn't follow from it if it were true, but then a false sentence can imply anything. And I've literally taken Security 101 in college.

You might also want to check the definition of "simple". Probably doesn't mean what you think it means.

> using regexp to parse the URL

A while back I whipped this up: https://gist.github.com/pmarreck/2956396

which seemed to work well (although it was a bit slower, I probably didn't know about exponential backtracking at the time and that could probably be revisited). It won't gather multiple name/value pairs though, but it will cut out and name basically every other part of the URL.

Wait what?

"Why not? URIs are at least able to be tokenized perfectly well by a regular expression."

"5. Bonus critique, anything using regular expressions to URL-encode or decode is very suspicious; strongly prefer built-in functions that do this."

Encoding or decoding is not tokenization.

And the problem is probably more accurately stated as "be suspicious of any function implementing encoding or decoding" rather than focusing on the regex part. Use the correct standard function. Don't bash something together yourself. They're actually pretty easy functions to write if you know what you're doing, but it's even easier to use some tested already-existing function. In fact, it's so easy that the fact that you see someone bashing together a URL encoding or decoding function almost certainly proves that they don't know what they are doing, which in turn means the URL encoding or decoding function was written by someone who doesn't know what they are doing. Unsurprisingly, these are, well, to quote myself, "suspicious".

Yes, that logic applies to URL parsing as well! Unfortunately, browsers make URL parsing extra hard, which is really stupid, so you end up with more crap in Javascript than anywhere else. Even then you ought to prefer someone else's tested solution over just smashing out a regular expression; however, it is not a knock on the tested solution if it is a regular expression-based solution.

> Why not?

This article is a perfect example of why not.