Hacker News new | ask | show | jobs
by stevek 5622 days ago
The single worst thing that people do wrong when writing regular expressions is to use .* x when really they should use [^x]* x (i.e. the common case of looking for some kind of terminator 'x')

The worst case I ever saw had many .* running over some c++ source which would take several minutes per file. Presumably trying all combinations of backtracking. With the negated character class [^x] it was < .1 of a second.

Edit: I see his twitter feed has .* as the icon. Ha!

Edit2: My * are getting eaten by the formatter. There should be no space after them

2 comments

This would be highly dependent on the implementation of the regex engine itself, would it not?
Right. The size of the difference will depend on the implementation details (e.g. if it uses backtracking or not), but the form the GP proposes will never be slower. And arguably it better captures the intent of the RE.
What about .*? instead ?

This makes the match non greedy.

But only supported in some regex implementations (I think it started only in Perl?). Maybe support in modern implementations is wider spread. I think that the sed that was commonly distributed with Linux distros around 2000 didn't support it, I have a vague recollection of spending a frustrating afternoon getting sed to imitate what my Perl regexes did.