|
|
|
|
|
by jacques_chester
1981 days ago
|
|
They're definitely thin on explaining the sample sizes. They say 54 configurations over "nearly 1,000", which suggest 17 tests (918 runs) or 18 tests(972 runs) per configuration. They run 4 different benchmarks (CPU, network, I/O, TPC-C), suggesting an average of around 4.25 or 4.5 per bechmark per configuration. If instead they ran 16 per configuration, that would be a nice round 4 per benchmark per configuration, but total runs would drop to 864, somewhat less than "nearly 1000". Assuming my figures are sound, we're looking at 4 to 5 samples per combination. Without some information about the within-group variation, though, it's difficult to distinguish what variation was due to "weather" and what was due to the platform. I do however think that the effect size of some results is enough to make them useful (eg, network throughput). But all of the close results (eg single-core difference between AWS and Azure) are not very reliable, in my view. |
|