|
|
|
|
|
by cmansley
5018 days ago
|
|
Wait. I am slightly confused and this may demonstrate my ignorance, but I was under the impression that A/B testing worked by allocating two different approach to the users and then scoring the response. This provides a sampling from the population of users as to how effective A or B is. You can then run some statistical test on the averages of the scores for each test to determine which one is the winner. If what I said is true, these statistical tests almost always assume that the distribution that is being drawn from is stationary. So, the only way things work out is if you have an underlying stationary distribution. Otherwise, your statistical test might indicate the wrong thing. I freely admit that many of your points are still valid, but I don't see how A/B is a more powerful algorithm that has less assumptions. |
|
The necessary, reasonable, and much weaker assumption needed for A/B testing is that users are independent of each other, and arrive by some Poisson process. Meaning that at any given point in time there is some average rate that users arrive, but each user's arrival is independent of all other arrivals. (More mathematically precisely, the number of people who will arrive in any specified time period follows a Poisson distribution. Poisson processes accurately model everything from nuclear decay counts to emergency room arrivals.) If you randomly divide a group of people who arrived by a Poisson process into two subgroups in a fixed ratio (say, evenly), those subgroups will also turn out to be generated by a Poisson process.
Now we're going to take those subgroups, and feed them into our versions. Let's focus on what happens with version A. Depending on when a user arrived, that user will have some chance of converting to a success, and some chance of not doing so. If we look at a random user that arrived and ignore when they arrived, their probability of converting will be the average of their probability of converting, weighted by the rate of arrival. Furthermore our assumption that users are arriving from a Poisson process means that each user is statistically independent from all of the others. Now it is true that if we pay attention to the fact that certain users arrived close to others, there are likely correlations to be found between those specifically users. But if you sample two random users from all of the users who could have arrived, they will be independent and have an identical probability of conversion, which is that average. (This flows out of the assumption that the initial population came from a Poisson process.)
This same analysis can be done for A and for B. Now we don't know the actual conversion rates over our trial. But suppose that we did know them, and it happened to be that the conversion rate for B is better at every time than for A. Then it it is easy to show that the average conversion rate (weighted by arrival rate of course) over the whole interval for B would be better than for A.
Therefore if we assume that there is a consistent preference, statistical evidence over the sample that B converts better than A is valid statistical evidence that B is actually the better version at any given point in time. This holds even if the difference between their conversion rate is much lower than the fluctuation of the conversion rates of both over the interval we sampled from.
(Very helpfully for A/B test practitioners, this math works whether or not you happen to understand it. As long as you're not adjusting sampling rates, A/B tests are statistically valid.)
Now take a traditional MAB algorithm. The whole analysis that I just did falls apart. The fact that we send traffic to the versions at different rates at different points of time means that the average for random people in the two versions is weighted differently over the interval. This opens up the possibility that the average conversion rate of the whole sample for B can be better than for A, yet A might at every point in time have had a better conversion rate than B.
See http://en.wikipedia.org/wiki/Simpsons_paradox if that seems impossible to you. If you've read that and understand how being better in the sample that a MAB algorithm collected is not statistical evidence that you're actually better, then you may want to re-read this post to understand why an A/B test is still statistically valid.
To avoid this trap, what you need to do in the MAB algorithm is either subsample or scale data (subsampling is provably more robust but not much so, scaling is simpler and converges faster) so that the statistical decision that you're basing your MAB choices on avoids this pitfall, at least in the limit. But as I said before, a detailed discussion of how to make this work and the necessary trade-offs would get fairly involved.