| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jacques_chester 1981 days ago

They're definitely thin on explaining the sample sizes. They say 54 configurations over "nearly 1,000", which suggest 17 tests (918 runs) or 18 tests(972 runs) per configuration.

They run 4 different benchmarks (CPU, network, I/O, TPC-C), suggesting an average of around 4.25 or 4.5 per bechmark per configuration. If instead they ran 16 per configuration, that would be a nice round 4 per benchmark per configuration, but total runs would drop to 864, somewhat less than "nearly 1000".

Assuming my figures are sound, we're looking at 4 to 5 samples per combination. Without some information about the within-group variation, though, it's difficult to distinguish what variation was due to "weather" and what was due to the platform.

I do however think that the effect size of some results is enough to make them useful (eg, network throughput). But all of the close results (eg single-core difference between AWS and Azure) are not very reliable, in my view.