|
|
|
|
|
by cmansley
5018 days ago
|
|
I think the real distinction here is stationary verses non-stationary distributions. Many of the arguments made in this article hinge on the fact that the responses to the same input change over time (nights are different then daytime, which is different than the weekends). By continuously running the A/B testing, you are looking at a small window in time, which you assume is stationary so you can do your t-tests or whatever statistical test. But, to be clear, this is a heuristic in A/B testing. If you have a window of time over which you know your distributions are stationary, you should always use a logarithmic regret MAB algorithm because it is theoretically better. I think the best way to frame this argument is that because the domain does not match the assumptions of MAB, A/B testing has been shown to be robust and easy to modify for non-stationary domains, while logarithmic regret algorithms are somewhat more fragile. |
|
A/B testing does not need to assume that it is testing over a small window in time that has stationary conversion rates. In fact in practice tests run for a long window of time over which you have good evidence that the distribution is not stationary. For example in the middle of running a long test that eventually found a small lift, I've often run and rolled out a second test that generated a much stronger lift.
The weaker assumption that you need is that the preference between versions is stable across your fluctuations. Then because the mix of versions is time independent, that non-stationary fluctuation is not statistically different between the slice that was put into A and the slice that was put into B. And therefore the variation between different samples becomes just another unknown random factor that does not interfere with your statistical analysis of whether there is a difference.
This is an advantage between A/B testing and the current state of the art in MAB algorithms. When this came up before, Noel (at Myna) and I privately did an admittedly brief search of the literature for discussion of this point with regards to MAB algorithms. We turned up a number of things that would work in the long run, but none that directly addressed the problem.
But in discussion we did manage to come up with effective MAB algorithms, whose regret is only a constant factor worse than standard MAB algorithms, that accurately will identify stable preferences in the face of constantly fluctuating conversion rates. To the best of my knowledge nobody, including Noel, has yet implemented such algorithms in practice. But in principle it can be done.
However even if you do it, several of my other points still apply as real differences.