|
|
|
|
|
by tisme
4974 days ago
|
|
Isn't cutting off at some confidence level on the 'do not do this' list for A/B testing? So far I understood that you test until you reach a predefined number of conversions, and then you check your significance to see if the result you obtained is valid. Not until you 'hit confidence'. See for instance: http://www.evanmiller.org/how-not-to-run-an-ab-test.html Is there anybody here that can shed some light on this? |
|
That's a blog post that I should get around to writing a rebuttal to some day. Because it is widely quoted and off base.
In theoretical frequentist math world, it is correct. If you peek repeatedly, eventually you'll come to confidence when there is no difference. Back in the real world, it is perfectly acceptable to use a strategy like, "We'll set a really high confidence (eg 99.5%) for cut-off until we get to a couple of thousand successes, and then we'll drop our standards substantially (eg 95% cut-off). If we are forced to stop for business reasons, we'll choose whatever happens to be ahead at the moment."
And yes, I can use Bayesian statistics to demonstrate that following the strategy that I describe creates acceptably low probabilities of making somewhat wrong business decisions, while allowing you to make good business decisions more quickly. And in practice people can follow it without needing a strong statistical background. (If I did enough work I could come up with a sophisticated optimal curve to use in making decisions. But I have not done that work, and in practice explaining it would be more work than it is worth.)
Why is this? Two reasons.
The first is that you only really get "independent peeks" at different orders of magnitude of data. Thus if you wait until you're past a small amount of data, you don't get a strong "repeated looks" effect.
Secondly coming to the wrong decision only matters to a business when the chosen option is substantially worse. If you follow a rule like what I gave, your odds of accidentally making up your mind in the wrong way if there is a business-significant difference are surprisingly low. For instance if you would detect a 2% difference as significant, and there is a real underlying difference of 1%, the odds that you're making the right decision right now deciding at a 95% confidence level is 99.2%. And if the real difference is a 0.5% win, your odds of making the right decision right now is 91.5%. (This despite the fact that you'd expect to need 16x as much data in order to even have a good chance of detecting a 0.5% win!)
Thus the decisions that you're making are usually correct. And on the occasions where you make the wrong choice, the mistake is usually not materially worse.