| "I did an A/A test, basically testing the same exact page––expecting the results would be the same." That's not how A/B testing works. :) Let's say we want to detect a 1% lift in some metric at 95% confidence and we set up an A/A test. We do the math and it tells us we need to sample 1,000 people to reach 95% confidence on a 1% lift. If we ran the A/A test 100 times, roughly 5 of them would show a statistically significant difference between the two groups. That's what "95% confidence" means -- it means your false positive rate is 5%. This is called a Type I Error. http://en.wikipedia.org/wiki/Type_I_and_type_II_errors You could run a kind of meta-analysis and use the false positive rate as the variable you're measuring to see if there's a statistically significant difference between the %5 false positive rate you expect and the false positive rate the A/B testing software generates in practice. In this case, your null hypothesis is that the "true alpha" of the A/B testing software is 0.05. You'd sample from among all the 95% confidence tests you run and see whether you can reject the null hypothesis. |