Hacker News new | ask | show | jobs
by chillydawg 3225 days ago
You need to first create a clean slate each time for running the experiment: no cache, no FILESYSTEM cache etc. Maybe a tonne of single use docker images? Even then filesystem caches will mess you up a little.

Beyond that, you need to run the same build "several" times to see what the variance is. Without getting specific, if the builds are within a couple percent of each other, do "a few" and take the mean. If they're all over the place do "lots" and only stop once the mean stabilises. There are specific methods to define "lots" and "a few" but it's usually obvious for large effects and you don't need to worry too much about it.

If you're trying to prove that you've made a 0.1 improvement on an underlying process that is normally distributed with a stddev of, like 2, then you're going to have to run it a lot and do some maths to show when to stop and accept the result.

2 comments

I want measurements with filesystem cache because I'm interested in estimating the speed of the compile-test-edit cycle. If you want to estimate the impact on emerge then you'll want no filesystem cache.

It's all about measuring based on what you intend to use the measurements for.

If the measurements are all over the place, why not take the fastest? The average is no good, because it'll be influenced by the times it wasn't running as fast as possible.

I don't myself lose much sleep over worrying about the times it runs faster than possible.

I agree with this sentiment. Any time worse than the fastest is due to noise in the system (schedulers etc). So the fastest is the lowest noise run.

Of course, as I said in another comment it depends what you want to do with the measurement. If you plan to edit how long a run will take on an existing system, then you need to accept the noise and use the mean (or median).