Hacker News new | ask | show | jobs
by maibus2 4316 days ago
Attribution issues aside, there's two scarier potential issues I see, actually I see these with all posts on "here's our company's cool new A/B testing framework", that really scare me:

1. You're running a ton of tests, yet I see no mention of how you're adjusting your tests to account for multiple testing. The more tests you run the higher the chance you have of getting a false positive. Couple this with the majority of things you test probably won't be significantly better, and your chance of encountering a false positive is much much higher than you might think. You're running hundreds of tests and using a p-value of 0.05, but your chances of a false positive in your tests is much much much higher than 5%. Beware multiple tests and especially beware of the base rate fallacy.

2. Your post has no mention of statistical power or the size of the effect you're looking to detect. That makes me think you might not have considered this. If you don't know the effect size you're looking for or your statistical power, your A/B test results can't be trusted -- as you have no idea what your chances are of actually detecting a beneficial result if it exists.

3 comments

These are both good points. I think it's worth calling out here that this post is really about the infrastructure to perform experiments, not necessarily the means of analysis (although the ui you see in the screenshots performs that analysis).

In terms of these issues, we handle them in a few ways. In particular, experiment review catches many of these issues. Think of it like code review for your experiments.

In order to run an experiment, we require you to have an "experiment helper" sign off on your change. This involves reviewing your group sizes, verifying that you have the statistical power you need to test the magnitude of change your hypothesis expects, verifying you interact with the framework correctly etc.. Training to become an experiment helper is generally not very easy, and involves a combination of shadowing existing reviewers, performing enough reviews across the stack, and taking a test to verify you understand potential errors (the test itself being composed of many experiments where we have made mistakes).

Changes to experiments (increasing group sizes, terminating an experiment, modifying an experiment etc.) all require this review.

1. Beware multiple tests: agreed. We could apply corrections (e.g., Bonferroni-like), but don't underestimate the logistical complication of this. Before a month starts, we don't even know how many experiments will be run (although we could try to predict). A different way to address this problem: use other information. Do the results make sense (good)? Is there corroboratory evidence (good)? Are there crazy outliers or things that smell funny (bad)? Of course the experimenter often has a bias (it worked!). Can they convince others? etc.

2. To compute statistical power, you need an estimate of effect size. For many experiments, we don't know; we could run pilots, but in the naive case that doubles the number of experiments to run, and time to wait. By default, we recommend a sample size that will detect a certain effect size (but not smaller ones). That is, we decide we are not interested in small effects, because it makes our life simpler.

1) it is true that the more tests you run the higher chance you see one FP, however the tests are not rejected/accepted as a whole. each test will show it's own results.

2) the size effect did take into considerations, there's also r scripts to analyze the results. the impact is trivial