|
|
|
|
|
by glutamate
671 days ago
|
|
Sounds like you already know this, but that's not great and will give a lot of false positives. In science this is called p-level hacking. The rigorous way to use hypothesis to testing is to calculate the sample size for the expected effect size and only one test when this sample size is achieved. But this requires knowing the effect size. If you are doing a lot of significance tests you need to adjust the p-level to divide by the number of implicit comparisons, so e.g. only accept p<0.001 if running ine test per day. Alternatively just do thompson sampling until one variant dominates. |
|
Thompson/multi-armed bandit optimizes for outcome over the duration of the test, by progressively altering the treatment %. The test runs longer, but yields better outcomes while doing it.
It's objectively a better way to optimize, unless there is time-based overhead to the existence of the A/B test itself. (E.g. maintaining two code paths.)