|
|
|
|
|
by kragen
5410 days ago
|
|
I feel like it's more error-prone. This regexp lexer took me several minutes to get right, and I'm still not totally sure it's bug-free: >>> replace = lambda text, env: ''.join(env[item[1:]] if item.startswith('$') else item[1:] if item.startswith('\\') else item for item in re.findall(r'[^\\$]+|\\.|\$\w+|\\$', text))
>>> print replace(r'This $line has \stuff \\in it that costs \$50 and some $variables.', {'line': 'LINE', 'variables': 'apples'})
This LINE has stuff \in it that costs $50 and some apples.
If I were to write out an explicit loop over the characters of the string, I would be a lot more sure that I wasn't accidentally dropping characters due to an inadvertent failure to make the regexp exhaustive (I originally forgot the \\$ case! Although that reduces to the empty string anyway) and I wouldn't have to forget and rediscover which lexical category each token belonged to.And, although it's not present in this case or in all regexp engines, it's a lot easier to accidentally write an exponential-time algorithm in a regexp than in a nested loop. And my experience has been that it's harder to debug it, too. |
|
"If I were to write out an explicit loop over the characters of the string, I would be a lot more sure that I wasn't accidentally dropping characters due to an inadvertent failure to make the regexp exhaustive"
This is why regex comments exist. For any non-trivial regex (more than a two or three characters), you should break down and document your regex. Otherwise, it's worse than a 1000 character-long perl one-liner.
"it's a lot easier to accidentally write an exponential-time algorithm in a regexp than in a nested loop"
So true. I did not realize it was possible until I made that mistake. Debugging these cases is extremely difficult. For two strings that look similar, the same regex can go crazy on one. But this happened to me only once in the past three years (time when I started to heavily rely on regular expressions for a project).