|
|
|
|
|
by cmansley
5018 days ago
|
|
First, I was using Student t-test as a stand-in for whatever test or statistical measure you would like to use. I believe the popular one is Hoeffding's inequality in the bandit literature, hence the log term in the MAB algorithms. I agree this was a poor choice of example. Second, I believe I am getting hung up on the fact that different arrival times are "weighted" differently. I think you are claiming that the Poisson assumption gives us equal numbers of A and B trials, so we can combined the statistics (counts) and avoid Simpson's paradox. This is fine, but why would you say "different times of arrival are not weighted the same". Does this mean you are somehow weighting periods of heavy traffic down and weighting low traffic up? So, what happens when trial A becomes less favorable over time or is less favorable for brief periods? This means that the underlying random variable's mean is changing over time. Most statistical bounds cannot handle this situation. I am not saying that A/B testing is not something we should do in general. I am saying that it is a good heuristic with very few provable properties compared with logarithmic regret MAB algorithms. |
|
You keep on reversing the exact point that I keep on making, and then fail to understand what I said. So I guess that I'll keep repeating the same point in different ways and hope that at some point you'll get it.
Why do I say reversing? Because the weight a time period gets is directly proportional to the expected traffic. Therefore each observation is weighted the same, and periods of heavy traffic are the ones that are weighted most heavily.
Anyways, let's suppose, for the sake of argument, that from 2 AM to 3 AM observations arrive at an average rate of 1 every 10 minutes. Suppose that from 8 AM to 9 AM that they arrive at an average rate of one per minute.
Then, on average, we expect to have 6 observations from the hour in the middle of the night, and 60 observations from the hour from 8 AM to 9 AM.
Thus when we calculate average returns across the entire interval, on average we'll have 10x as many observations from 8 AM to 9 AM. Therefore on average the latter time period will have 10x the impact on the final results.
The conclusion is that different time periods are naturally weighted differently. However the weighting is the same across the two different versions.
If you want to get more mathematical about it, suppose that r(t) is the average rate at which observations are arriving in our subgroups. (So r(t) is the same for versions A and B.) Suppose that cA(t) is the rate at which version A converts, and suppose that cB(t) is the rate at which version B converts.
Here is what I claim:
Average conversion of A = integral(r(t) * cA(t)) / integral(r(t))
Average conversion of B = integral(r(t) * cB(t)) / integral(r(t))
Therefore if at all points cA(t) < cB(t) then the difference between their conversion rates is:
integral(r(t) * cB(t)) / integral(r(t)) - integral(r(t) * cA(t)) / integral(r(t)) = integral(r(t) * cB(t) - r(t) * cA(t)) / integral(r(t))
which is always positive. (It should be noted that this analysis remains the same whether we're looking for a binary convert/no convert, or whether we're looking at a more complex signal, such as amount paid. If we add the complication that people entering the test may convert to payments at one or multiple later points, the analysis becomes more complicated, but the result remains the same.)
So, what happens when trial A becomes less favorable over time or is less favorable for brief periods? This means that the underlying random variable's mean is changing over time. Most statistical bounds cannot handle this situation.
As long as there is a consistent preference between A and B, fluctuations in either or both do not alter the validity of the statistical analysis. If the preference is not consistent then, of course, A/B testing stops being valid.
I am not saying that A/B testing is not something we should do in general. I am saying that it is a good heuristic with very few provable properties compared with logarithmic regret MAB algorithms.
The fact that you are not following this proof does not mean that the proof I am offering you is invalid. In fact A/B testing has provable properties that, in a common real-world situation, are _much_ better than current state of the art logarithmic regret MAB algorithms.
I am also claiming (without proof) that this deficiency in current MAB algorithms is fixable, at the cost of a constant factor worse regret in the ideal situation where conversion rates do not change.