| HN Mirror

I'm thinking of it as a statistical hypothesis test. The null hypothesis is that they come from the same distribution. Under that hypothesis, there's only a 0.05 chance of seeing three X tests all below three Y tests. So if we see this, we can probably reject the null.

If we think X and Y distributions are both something like normal with similar variance, then we should also be able to say the chance of XXXYYY given Y is better than X is at most 0.05.

But if the distributions for X and Y can be really different, then I think you're right -- this test could be misleading! For example, say Y always takes 2 seconds, and X takes 1 second 90% of the time, but 1% of the time it takes an hour. If we run three tests of each, we'll probably only see good runs from X and conclude it's better, when it's not.