| Correction. A/B testing does not need to assume that it is testing over a small window in time that has stationary conversion rates. In fact in practice tests run for a long window of time over which you have good evidence that the distribution is not stationary. For example in the middle of running a long test that eventually found a small lift, I've often run and rolled out a second test that generated a much stronger lift. The weaker assumption that you need is that the preference between versions is stable across your fluctuations. Then because the mix of versions is time independent, that non-stationary fluctuation is not statistically different between the slice that was put into A and the slice that was put into B. And therefore the variation between different samples becomes just another unknown random factor that does not interfere with your statistical analysis of whether there is a difference. This is an advantage between A/B testing and the current state of the art in MAB algorithms. When this came up before, Noel (at Myna) and I privately did an admittedly brief search of the literature for discussion of this point with regards to MAB algorithms. We turned up a number of things that would work in the long run, but none that directly addressed the problem. But in discussion we did manage to come up with effective MAB algorithms, whose regret is only a constant factor worse than standard MAB algorithms, that accurately will identify stable preferences in the face of constantly fluctuating conversion rates. To the best of my knowledge nobody, including Noel, has yet implemented such algorithms in practice. But in principle it can be done. However even if you do it, several of my other points still apply as real differences. |
I am slightly confused and this may demonstrate my ignorance, but I was under the impression that A/B testing worked by allocating two different approach to the users and then scoring the response. This provides a sampling from the population of users as to how effective A or B is. You can then run some statistical test on the averages of the scores for each test to determine which one is the winner.
If what I said is true, these statistical tests almost always assume that the distribution that is being drawn from is stationary. So, the only way things work out is if you have an underlying stationary distribution. Otherwise, your statistical test might indicate the wrong thing.
I freely admit that many of your points are still valid, but I don't see how A/B is a more powerful algorithm that has less assumptions.