| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by apag 5457 days ago
	Then again reading said book made me believe they’re fairly easy to write well. You need to keep in mind what a quantifier really does (“this will gobble up the whole string and then yield bits until the pattern matches”), but in the end I find it not fundamentally any more taxing than a having in my head a rough idea of the behaviour of a few nested loops or recursions.

1 comments

kragen 5457 days ago

I feel like it's more error-prone. This regexp lexer took me several minutes to get right, and I'm still not totally sure it's bug-free:

    >>> replace = lambda text, env: ''.join(env[item[1:]] if item.startswith('$') else item[1:] if item.startswith('\\') else item for item in re.findall(r'[^\\$]+|\\.|\$\w+|\\$', text))
    >>> print replace(r'This $line has \stuff \\in it that costs \$50 and some $variables.', {'line': 'LINE', 'variables': 'apples'})
    This LINE has stuff \in it that costs $50 and some apples.

If I were to write out an explicit loop over the characters of the string, I would be a lot more sure that I wasn't accidentally dropping characters due to an inadvertent failure to make the regexp exhaustive (I originally forgot the \\$ case! Although that reduces to the empty string anyway) and I wouldn't have to forget and rediscover which lexical category each token belonged to.

And, although it's not present in this case or in all regexp engines, it's a lot easier to accidentally write an exponential-time algorithm in a regexp than in a nested loop. And my experience has been that it's harder to debug it, too.

link

St-Clock 5457 days ago

I agree and disagree:

"If I were to write out an explicit loop over the characters of the string, I would be a lot more sure that I wasn't accidentally dropping characters due to an inadvertent failure to make the regexp exhaustive"

This is why regex comments exist. For any non-trivial regex (more than a two or three characters), you should break down and document your regex. Otherwise, it's worse than a 1000 character-long perl one-liner.

"it's a lot easier to accidentally write an exponential-time algorithm in a regexp than in a nested loop"

So true. I did not realize it was possible until I made that mistake. Debugging these cases is extremely difficult. For two strings that look similar, the same regex can go crazy on one. But this happened to me only once in the past three years (time when I started to heavily rely on regular expressions for a project).

link

kragen 5456 days ago

> This is why regex comments exist.

Regex comments don't help much with inadvertently writing a non-exhaustive regex (i.e. one for which some possible input could fail to match), or a few other kinds of regexp bugs. Or, how would you write the regexp in the above code with comments so that it would be obvious if you left out the \\$ case?

link