Hacker News new | ask | show | jobs
by noselasd 3225 days ago
So honest question from a non-statistician,

how, concretely, should I go about doing this particular analyzis of compile time for one project ? How many times should I run the build for each of the 2 compilers and what should I do with the result so I could; 1. Draw a conclusion 2. Come up with fair numbers of how they compare ?

I would hope someone could tech this hopefully simple and very concrete thing to the HN crowd and I do hope the answer is not "go learn statistics".

2 comments

You need to first create a clean slate each time for running the experiment: no cache, no FILESYSTEM cache etc. Maybe a tonne of single use docker images? Even then filesystem caches will mess you up a little.

Beyond that, you need to run the same build "several" times to see what the variance is. Without getting specific, if the builds are within a couple percent of each other, do "a few" and take the mean. If they're all over the place do "lots" and only stop once the mean stabilises. There are specific methods to define "lots" and "a few" but it's usually obvious for large effects and you don't need to worry too much about it.

If you're trying to prove that you've made a 0.1 improvement on an underlying process that is normally distributed with a stddev of, like 2, then you're going to have to run it a lot and do some maths to show when to stop and accept the result.

I want measurements with filesystem cache because I'm interested in estimating the speed of the compile-test-edit cycle. If you want to estimate the impact on emerge then you'll want no filesystem cache.

It's all about measuring based on what you intend to use the measurements for.

If the measurements are all over the place, why not take the fastest? The average is no good, because it'll be influenced by the times it wasn't running as fast as possible.

I don't myself lose much sleep over worrying about the times it runs faster than possible.

I agree with this sentiment. Any time worse than the fastest is due to noise in the system (schedulers etc). So the fastest is the lowest noise run.

Of course, as I said in another comment it depends what you want to do with the measurement. If you plan to edit how long a run will take on an existing system, then you need to accept the noise and use the mean (or median).

There are people who have thought about this, e.g., http://onlinelibrary.wiley.com/doi/10.1002/cpe.2939/full

Personally I think it's a better idea to instrument your programs and count the number of memory (block) accesses or something. That metric might actually be useful to a reader a few years in the future. The fact that your program was running faster on a modern x86 processor from the year 2010 tells me nothing about how it would perform today, unless the difference was so large that you never needed statistical testing in the first place...

edit: I'm not sure if this paper is accessible to everyone, so here is an alternate link https://hal.inria.fr/inria-00443839v1/document