Hacker News new | ask | show | jobs
by IshKebab 995 days ago
I've looked into this before and there are very few tools for this. The only vaguely generic one I've found is Codespeed: https://github.com/tobami/codespeed

However it's not very good. Seems like most people just write their own custom performance monitoring tooling.

As for how you actually run it, you can get fairly low noise runtimes by running on a dedicated machine on Linux. You have to do some tricks like pinning your program to dedicated CPU cores and making sure nothing else can run on them. You can get under 1% variance that way, but in general I found you can't really get low enough variance on wall time to be useful in most cases, so instruction count is a better metric.

I think you could do better than instruction count though but it would be a research project - take all the low noise performance metrics you can measure (instruction count, branch misses etc), measure a load of wall times for different programs and different systems (core count, RAM size etc.). Feed it into some kind of ML system and that should give you a decent model to get a low noise wall time estimate.

Good tips here:

https://llvm.org/docs/Benchmarking.html

https://easyperf.net/blog/2019/08/02/Perf-measurement-enviro...

1 comments

Surely it’s possible to build some benchmark to demonstrate the difference right? Otherwise, what’s the point of making that improvement in the first place?

I think what you’re saying though is that having benchmarks/micro benchmarks that are cheap to run is valuable and in those instruction counts may be the only way to measure a 5% improvement (you’d have to run the test for a whole lot longer to prove that a 5% instruction count improvement is a real 1% wall clock improvement and not just noise). Even criterion gets real iffy about small improvements and it tries to build a statistical model.

> Surely it’s possible to build some benchmark to demonstrate the difference right? Otherwise, what’s the point of making that improvement in the first place?

No, sometimes the improvement you made is like 0.5% faster. It's very very difficult to show that that is actually faster by real wall clock measurements so you have to use a more stable proxy.

What's the point of a 0.5% improvement? Well, not much. But you don't do one you do 20 and cumulatively your code is 10% faster.

I really recommend Nicholas Nethercote's blog posts. A good lesson in micro-optimisation (and some macro-optimisation).

> It's very very difficult to show that that is actually faster by real wall clock measurements so you have to use a more stable proxy.

That’s what I’m saying though. You don’t actually need a stable proxy. You should be able to quantify the wall clock improvement but it requires a very long measurement time. For example, a 0.5% improvement amounts to a benchmark that takes 1 day completing 7 minutes earlier. The reason you use a stable proxy is that the benchmark can finish more quickly to shorten the feedback loop. But relying too much on the proxy can also be harmful because you can decrease the instruction count and slow down wall clock (or vice-versa). That’s because wall clock performance is more complex because branch prediction, data dependencies, and cache performance also really matter.

So if you want to be really diligent with your benchmarks (and you should when micro optimizing to this degree), you should validate your assumptions by confirming impact with wall clock time as that’s “the thing” your actually optimizing, not cycle counts for cycle counts sake (same with power if you’re optimizing the power performance of your code or memory usage). Never forget that once a proxy measurement can stop being a good measurement once it becomes the target rather than the thing you actually want to measure.