Hacker News new | ask | show | jobs
by 0xbadcafebee 3618 days ago
Heh.

https://mathiasbynens.be/demo/url-regex

https://lostechies.com/chadmyers/2010/11/20/parsing-a-url-wi...

https://stackoverflow.com/questions/27745/getting-parts-of-a...

What do all these have in common? They all demonstrate that it is hard to write a regex that parses URLs.

Regex's hide programming mistakes because they not only become harder for humans to parse as they get more complicated, but also there isn't simple programming code to handle potential security vulnerabilities in-line with the code. It also hides the [at least] two considerations of secure programming: designing a secure function, and handling the function securely.

2 comments

It is hard to write code that parses URLs, period.

You can't look at regexs in isolation, see that a task is hard, then declare them unfit. You have to consider them as one of the many choices and analyze the cost/benefits of the whole suite of options.

I guarantee you that anyone who has said "Oh, gosh, this is hard, I'll just start using indexOf and substring operations" has written code that is just as broken, only in ways much harder to tell.

Which is probably why everyone here thinks it's better to not use regex. You didn't write better code... you wrote code that hid its brokenness better. That's not a good thing!

Again, my real point here is not "regexes are awesome in every way"... my point is that I literally glanced at that code and saw several ways in which it was wrong. Does your alternative have that property?

Also, some of the difficulties of regexes are accidental, not essential. Take something like the recent Perl 6 efforts for parsing and you're far better off in every way using that stuff than trying to bash together string-manipulation-based parsing, or whatever other alternatives you may be thinking of. The Perl 6 constructs will be more readable and more maintainable. (Perl 6 is crazy in a lot of ways but the parsing support is best-of-breed.)

Regular expressions describe finite state machines and in programming there's nothing simpler than finite state machines / finite automatons. Your handling of vulnerabilities inline is anything but simple. This is CS 101.
The handling of security is anything but simple. Hence, a simple solution is anything but secure. This is Security 101.
I can't even parse that. First sentence is false. Second sentence wouldn't follow from it if it were true, but then a false sentence can imply anything. And I've literally taken Security 101 in college.

You might also want to check the definition of "simple". Probably doesn't mean what you think it means.