|
|
|
|
|
by btilly
5020 days ago
|
|
First let me mention that you should not try to use the Student's t-test. One of the first assumptions of the Student's t-test is that Each of the two populations being compared should follow a normal distribution. In A/B tests this assumption is almost never true, and therefore the Student's t-test is an inappropriate statistical assumption. OK, now to what I said about Poisson distributions. Assuming that people arrive on a Poisson distribution allows us to conclude 2 key facts: 1. The statistics will behave exactly like it would if each person arrives IID from an infinite population. 2. Simpson's paradox will not apply to the theoretical distribution of the samples for A and B. Assuming #1 without #2 does not get you very far. But having facts #1 and #2 allows us to use statistics. I have no idea why you would speculate that I am attempting to weight recent samples higher and downweight earlier samples. All samples are, in fact, weighted exactly the same. This fact notwithstanding, different times of arrival are not weighted the same. That is because the sample rate fluctuates over time depending on factors such as traffic levels on your webserver. But it fluctuates in an identical way for the two versions. (This fact is critical in being able to conclude point #2.) Does this help? |
|
Second, I believe I am getting hung up on the fact that different arrival times are "weighted" differently. I think you are claiming that the Poisson assumption gives us equal numbers of A and B trials, so we can combined the statistics (counts) and avoid Simpson's paradox. This is fine, but why would you say "different times of arrival are not weighted the same". Does this mean you are somehow weighting periods of heavy traffic down and weighting low traffic up?
So, what happens when trial A becomes less favorable over time or is less favorable for brief periods? This means that the underlying random variable's mean is changing over time. Most statistical bounds cannot handle this situation.
I am not saying that A/B testing is not something we should do in general. I am saying that it is a good heuristic with very few provable properties compared with logarithmic regret MAB algorithms.