| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by btilly 5065 days ago

Correction.

A/B testing does not need to assume that it is testing over a small window in time that has stationary conversion rates. In fact in practice tests run for a long window of time over which you have good evidence that the distribution is not stationary. For example in the middle of running a long test that eventually found a small lift, I've often run and rolled out a second test that generated a much stronger lift.

The weaker assumption that you need is that the preference between versions is stable across your fluctuations. Then because the mix of versions is time independent, that non-stationary fluctuation is not statistically different between the slice that was put into A and the slice that was put into B. And therefore the variation between different samples becomes just another unknown random factor that does not interfere with your statistical analysis of whether there is a difference.

This is an advantage between A/B testing and the current state of the art in MAB algorithms. When this came up before, Noel (at Myna) and I privately did an admittedly brief search of the literature for discussion of this point with regards to MAB algorithms. We turned up a number of things that would work in the long run, but none that directly addressed the problem.

But in discussion we did manage to come up with effective MAB algorithms, whose regret is only a constant factor worse than standard MAB algorithms, that accurately will identify stable preferences in the face of constantly fluctuating conversion rates. To the best of my knowledge nobody, including Noel, has yet implemented such algorithms in practice. But in principle it can be done.

However even if you do it, several of my other points still apply as real differences.

1 comments

cmansley 5065 days ago

Wait.

I am slightly confused and this may demonstrate my ignorance, but I was under the impression that A/B testing worked by allocating two different approach to the users and then scoring the response. This provides a sampling from the population of users as to how effective A or B is. You can then run some statistical test on the averages of the scores for each test to determine which one is the winner.

If what I said is true, these statistical tests almost always assume that the distribution that is being drawn from is stationary. So, the only way things work out is if you have an underlying stationary distribution. Otherwise, your statistical test might indicate the wrong thing.

I freely admit that many of your points are still valid, but I don't see how A/B is a more powerful algorithm that has less assumptions.

link

btilly 5065 days ago

Perhaps stepping back can clarify.

The necessary, reasonable, and much weaker assumption needed for A/B testing is that users are independent of each other, and arrive by some Poisson process. Meaning that at any given point in time there is some average rate that users arrive, but each user's arrival is independent of all other arrivals. (More mathematically precisely, the number of people who will arrive in any specified time period follows a Poisson distribution. Poisson processes accurately model everything from nuclear decay counts to emergency room arrivals.) If you randomly divide a group of people who arrived by a Poisson process into two subgroups in a fixed ratio (say, evenly), those subgroups will also turn out to be generated by a Poisson process.

Now we're going to take those subgroups, and feed them into our versions. Let's focus on what happens with version A. Depending on when a user arrived, that user will have some chance of converting to a success, and some chance of not doing so. If we look at a random user that arrived and ignore when they arrived, their probability of converting will be the average of their probability of converting, weighted by the rate of arrival. Furthermore our assumption that users are arriving from a Poisson process means that each user is statistically independent from all of the others. Now it is true that if we pay attention to the fact that certain users arrived close to others, there are likely correlations to be found between those specifically users. But if you sample two random users from all of the users who could have arrived, they will be independent and have an identical probability of conversion, which is that average. (This flows out of the assumption that the initial population came from a Poisson process.)

This same analysis can be done for A and for B. Now we don't know the actual conversion rates over our trial. But suppose that we did know them, and it happened to be that the conversion rate for B is better at every time than for A. Then it it is easy to show that the average conversion rate (weighted by arrival rate of course) over the whole interval for B would be better than for A.

Therefore if we assume that there is a consistent preference, statistical evidence over the sample that B converts better than A is valid statistical evidence that B is actually the better version at any given point in time. This holds even if the difference between their conversion rate is much lower than the fluctuation of the conversion rates of both over the interval we sampled from.

(Very helpfully for A/B test practitioners, this math works whether or not you happen to understand it. As long as you're not adjusting sampling rates, A/B tests are statistically valid.)

Now take a traditional MAB algorithm. The whole analysis that I just did falls apart. The fact that we send traffic to the versions at different rates at different points of time means that the average for random people in the two versions is weighted differently over the interval. This opens up the possibility that the average conversion rate of the whole sample for B can be better than for A, yet A might at every point in time have had a better conversion rate than B.

See http://en.wikipedia.org/wiki/Simpsons_paradox if that seems impossible to you. If you've read that and understand how being better in the sample that a MAB algorithm collected is not statistical evidence that you're actually better, then you may want to re-read this post to understand why an A/B test is still statistically valid.

To avoid this trap, what you need to do in the MAB algorithm is either subsample or scale data (subsampling is provably more robust but not much so, scaling is simpler and converges faster) so that the statistical decision that you're basing your MAB choices on avoids this pitfall, at least in the limit. But as I said before, a detailed discussion of how to make this work and the necessary trade-offs would get fairly involved.

link

cmansley 5065 days ago

I don't know how we got onto arrival rates of our visitors. I would just like to state that Simpson's paradox is exactly why we shouldn't compare percent conversions. They are meaningless. However, many of the statistical tests like Student's t-test compensate for this paradox by including the number of samples in the tests. See : http://en.wikipedia.org/wiki/Student%27s_t-test#Unequal_samp...

I think you said one thing that are at the heart of the issue. We assume that the E[conversion of B] > E[conversion of A] for the entire period sampled.

I think all of the details about Poisson processes are not required if you just assume that each person is drawn IID from the population.

I just don't think you are answering the right questions here.

Let's assume that each person arrives IID from the infinite population. Then, we have a Bernoulli process for each A or B query. A "conversion" results in a 1, a failure results in a 0. Since, these people are arriving IID, we can select a sub-sample which is also IID. We would now like to estimate the parameters for each process and/or compare the two processes. We can do this using t-test. This will give us the statistical significance that the one group had a higher "conversion rate" than the other group. Note: rate does not factor into this problem at all because we assume the participants were IID, so the t-test (used correctly accounting for different number of samples) will tell us which test is larger.

My question is now what happens when the parameters of your queries for A or B change over time. Still under the assumption that E[B] > E[A], it now matters greatly in which order you use your samples.

I think the only reason you brought the Poisson model into the discussion is to weight the more recent samples higher and down weight the earlier samples in your basket of samples. This is a heuristic for considering a fixed interval in which the samples are stationary. It effectively considers a window that slides with the time of arrival.

link

btilly 5065 days ago

First let me mention that you should not try to use the Student's t-test. One of the first assumptions of the Student's t-test is that Each of the two populations being compared should follow a normal distribution. In A/B tests this assumption is almost never true, and therefore the Student's t-test is an inappropriate statistical assumption.

OK, now to what I said about Poisson distributions. Assuming that people arrive on a Poisson distribution allows us to conclude 2 key facts:

1. The statistics will behave exactly like it would if each person arrives IID from an infinite population.

2. Simpson's paradox will not apply to the theoretical distribution of the samples for A and B.

Assuming #1 without #2 does not get you very far. But having facts #1 and #2 allows us to use statistics.

I have no idea why you would speculate that I am attempting to weight recent samples higher and downweight earlier samples. All samples are, in fact, weighted exactly the same. This fact notwithstanding, different times of arrival are not weighted the same. That is because the sample rate fluctuates over time depending on factors such as traffic levels on your webserver. But it fluctuates in an identical way for the two versions. (This fact is critical in being able to conclude point #2.)

Does this help?

link

cmansley 5065 days ago

First, I was using Student t-test as a stand-in for whatever test or statistical measure you would like to use. I believe the popular one is Hoeffding's inequality in the bandit literature, hence the log term in the MAB algorithms. I agree this was a poor choice of example.

Second, I believe I am getting hung up on the fact that different arrival times are "weighted" differently. I think you are claiming that the Poisson assumption gives us equal numbers of A and B trials, so we can combined the statistics (counts) and avoid Simpson's paradox. This is fine, but why would you say "different times of arrival are not weighted the same". Does this mean you are somehow weighting periods of heavy traffic down and weighting low traffic up?

So, what happens when trial A becomes less favorable over time or is less favorable for brief periods? This means that the underlying random variable's mean is changing over time. Most statistical bounds cannot handle this situation.

I am not saying that A/B testing is not something we should do in general. I am saying that it is a good heuristic with very few provable properties compared with logarithmic regret MAB algorithms.

link

btilly 5065 days ago

This is fine, but why would you say "different times of arrival are not weighted the same". Does this mean you are somehow weighting periods of heavy traffic down and weighting low traffic up?

You keep on reversing the exact point that I keep on making, and then fail to understand what I said. So I guess that I'll keep repeating the same point in different ways and hope that at some point you'll get it.

Why do I say reversing? Because the weight a time period gets is directly proportional to the expected traffic. Therefore each observation is weighted the same, and periods of heavy traffic are the ones that are weighted most heavily.

Anyways, let's suppose, for the sake of argument, that from 2 AM to 3 AM observations arrive at an average rate of 1 every 10 minutes. Suppose that from 8 AM to 9 AM that they arrive at an average rate of one per minute.

Then, on average, we expect to have 6 observations from the hour in the middle of the night, and 60 observations from the hour from 8 AM to 9 AM.

Thus when we calculate average returns across the entire interval, on average we'll have 10x as many observations from 8 AM to 9 AM. Therefore on average the latter time period will have 10x the impact on the final results.

The conclusion is that different time periods are naturally weighted differently. However the weighting is the same across the two different versions.

If you want to get more mathematical about it, suppose that r(t) is the average rate at which observations are arriving in our subgroups. (So r(t) is the same for versions A and B.) Suppose that cA(t) is the rate at which version A converts, and suppose that cB(t) is the rate at which version B converts.

Here is what I claim:

Average conversion of A = integral(r(t) * cA(t)) / integral(r(t))

Average conversion of B = integral(r(t) * cB(t)) / integral(r(t))

Therefore if at all points cA(t) < cB(t) then the difference between their conversion rates is:

integral(r(t) * cB(t)) / integral(r(t)) - integral(r(t) * cA(t)) / integral(r(t)) = integral(r(t) * cB(t) - r(t) * cA(t)) / integral(r(t))

which is always positive. (It should be noted that this analysis remains the same whether we're looking for a binary convert/no convert, or whether we're looking at a more complex signal, such as amount paid. If we add the complication that people entering the test may convert to payments at one or multiple later points, the analysis becomes more complicated, but the result remains the same.)

As long as there is a consistent preference between A and B, fluctuations in either or both do not alter the validity of the statistical analysis. If the preference is not consistent then, of course, A/B testing stops being valid.

I am not saying that A/B testing is not something we should do in general. I am saying that it is a good heuristic with very few provable properties compared with logarithmic regret MAB algorithms.

The fact that you are not following this proof does not mean that the proof I am offering you is invalid. In fact A/B testing has provable properties that, in a common real-world situation, are _much_ better than current state of the art logarithmic regret MAB algorithms.

I am also claiming (without proof) that this deficiency in current MAB algorithms is fixable, at the cost of a constant factor worse regret in the ideal situation where conversion rates do not change.

link

cmansley 5065 days ago

Also, I agree that MAB algorithms do not consider the case of adding bandit "arms" during the trial. A dynamic number of bandit arms problem may not be well studied, but is very interesting. This might be another reason for requiring a reweighing of the samples.

link