Hacker News new | ask | show | jobs
by silisili 1729 days ago
> They stopped compiling regexps on the fly and moved the regexps to package variables. (I actually don't know if this was a significant win; there might just be the three big wins.)

Anecdotally, this could be a huge win, depending on how often it's called.

A guy I was working with, new to Go, was writing a router config parser and asked why it was so slow.

The first thing I did was moved regexp.Compile from a hot path into a broader scope. It went from something like 40 seconds down to 2 on my machine.

3 comments

I think it's easy to assume that in this case Go's regex library would keep an internal cache of expressions, using the expression string as a map key. But on the other, I can see why they haven't implemented it, because it obscures memory usage from direct control of the author.

It would probably be a good idea to add performance hints like 'prefer to put static regular expressions in a package variable' in a linter or go vet.

Actually I would expect any package not to silently cache things until explicitly specified. This otherwise creates an unbounded memory leak.

Moving static (at least as much it concerns the loop) expressions out of a loop is one of the most fundamental optimizations a programmer should do when writing code.

> I think it's easy to assume that in this case Go's regex library would keep an internal cache of expressions

IMHO, the stdlib doing implicit memoization is a catastrophe waiting to happen.

I think that handling regexps and caching functions are two composable and orthogonal features that should be handled by two packages/libs/... .

Spring (boot) works exactly the same. We once found that 30% of CPU time is spent parsing path regexes in Controllers somewhere deep inside the Spring. We had rewritten 1500 endpoints to hardcoded paths and it fixed CPU usage.
I've seen the same in Python, probably a dozen times. Sometimes folks think it's ugly (un-pythonic) but there's plenty of cases in the standard library to point to.
That's because the Python regex module caches the regexes it compiles, so it only happens once. It's proper and good usage to specify the regex string inline, even in a hot path.

I'd only use a variable when I'm using the same regex multiple times in code, and even then I could still just have the variable be the string.

> That's because the Python regex module caches the regexes it compiles, so it only happens once. It's proper and good usage to specify the regex string inline, even in a hot path.

Last time I had a look the regex cache was pretty small (few hundred entries) and gets completely cleared when full. Might have improved since, but historically it was very simplistic.

I disagree that it’s “proper and good usage” to specify regex inline. It’s fine for many usages but that’s as far as I’d go.

Even then, hashing and lookup is completely unnecessary in a hot path. Having a variable with a compiled regex is not unpythonic AFAIK
Yeah, it's been a while since I've benchmarked that, I'll try it out
Gotta agree with the sibling comments here. The performance difference is definitely smaller than it used to be, but there's still good reasons to keep compiled regexes in a module scope.

Caveat: I write libraries, not "production" code; my requirements are significantly more strict. One thing I can't do is make assumptions about where my code will run. If you're using my library, and you compile a whole bunch of regexes, they'll evict my regexes from the cache. I don't want the performance of my library to suffer, so I'll keep them in the module scope.