Hacker News new | ask | show | jobs
by anp 2816 days ago
Each benchmark result is only compared against values from running on literally the same machine, actually. I agree that good results here would be extremely difficult to produce on virtualized infra, so I rented a few cheap dedicated servers from Hetzner. I'm glad that I decided to pin results to a single machine, because even between these identically binned machines from Hetzner I saw 2-4% variance between them when I ran some phoronix benches to compare.

I go into a little bit of detail on this in the talk I link to towards the bottom of the post, here's a direct link for convenience: https://www.youtube.com/watch?v=gSFTbJKScU0.

2 comments

A suggestion: consider using callgrind to measure performance (instructions retired, cache misses, branch mispredictions, whatever) instead of wall clock time. It will be much slower per run, but since it will also be precise you shouldn't need to do multiple runs, and you should be able to run a bunch of different benchmarks concurrently without them interfering with each other or having anything else interfere with them.
I currently do something pretty similar by using the perf subsystem in the Linux kernel to track the behavior of each benchmark function. In my early measurements I found concurrent benchmarking to introduce unacceptable noise even with this measurement tool and with cgroups/cpusets used to pin the different processes to their own cores. Instead of trying to tune the system to account for this, I chose to build tooling for managing a single runner per small cheap machine.
No such 'noise' is possible with callgrind, as it's basically simulating the hardware. If you're using a VM it seems like you could still get variation between different runs due to other activity on the host system.
The problem with callgrind is (http://valgrind.org/docs/manual/cg-manual.html#branch-sim):

> Cachegrind simulates branch predictors intended to be typical of mainstream desktop/server processors of around 2004.

In other words, the data produced by Callgrind may be suitable to find obvious regressions, but there still may be more regressions which are only relevant on more modern CPUs.

Please don't, because memory access pattern will be very different.
Some of those target benchmarks are on Rayon, and we've found that valgrind interferes with threading way too much to be useful there.
This is one of the many metrics of the official Rust compiler performance benchmarks [1].

[1]: https://perf.rust-lang.org/nll-dashboard.html

I haven't used callgrind, but wouldn't running benchmarks concurrently still lead to cache interference?
No, because callgrind is simulating the hardware, including the caches. Which is why it's also much slower.
Thanks for the link. I'll give it watch :D