|
|
|
|
|
by maibus2
4316 days ago
|
|
Attribution issues aside, there's two scarier potential issues I see, actually I see these with all posts on "here's our company's cool new A/B testing framework", that really scare me: 1. You're running a ton of tests, yet I see no mention of how you're adjusting your tests to account for multiple testing. The more tests you run the higher the chance you have of getting a false positive. Couple this with the majority of things you test probably won't be significantly better, and your chance of encountering a false positive is much much higher than you might think. You're running hundreds of tests and using a p-value of 0.05, but your chances of a false positive in your tests is much much much higher than 5%. Beware multiple tests and especially beware of the base rate fallacy. 2. Your post has no mention of statistical power or the size of the effect you're looking to detect. That makes me think you might not have considered this. If you don't know the effect size you're looking for or your statistical power, your A/B test results can't be trusted -- as you have no idea what your chances are of actually detecting a beneficial result if it exists. |
|
In terms of these issues, we handle them in a few ways. In particular, experiment review catches many of these issues. Think of it like code review for your experiments.
In order to run an experiment, we require you to have an "experiment helper" sign off on your change. This involves reviewing your group sizes, verifying that you have the statistical power you need to test the magnitude of change your hypothesis expects, verifying you interact with the framework correctly etc.. Training to become an experiment helper is generally not very easy, and involves a combination of shadowing existing reviewers, performing enough reviews across the stack, and taking a test to verify you understand potential errors (the test itself being composed of many experiments where we have made mistakes).
Changes to experiments (increasing group sizes, terminating an experiment, modifying an experiment etc.) all require this review.