Hacker News new | ask | show | jobs
by cmansley 5018 days ago
I don't know how we got onto arrival rates of our visitors. I would just like to state that Simpson's paradox is exactly why we shouldn't compare percent conversions. They are meaningless. However, many of the statistical tests like Student's t-test compensate for this paradox by including the number of samples in the tests. See : http://en.wikipedia.org/wiki/Student%27s_t-test#Unequal_samp...

I think you said one thing that are at the heart of the issue. We assume that the E[conversion of B] > E[conversion of A] for the entire period sampled.

I think all of the details about Poisson processes are not required if you just assume that each person is drawn IID from the population.

I just don't think you are answering the right questions here.

Let's assume that each person arrives IID from the infinite population. Then, we have a Bernoulli process for each A or B query. A "conversion" results in a 1, a failure results in a 0. Since, these people are arriving IID, we can select a sub-sample which is also IID. We would now like to estimate the parameters for each process and/or compare the two processes. We can do this using t-test. This will give us the statistical significance that the one group had a higher "conversion rate" than the other group. Note: rate does not factor into this problem at all because we assume the participants were IID, so the t-test (used correctly accounting for different number of samples) will tell us which test is larger.

My question is now what happens when the parameters of your queries for A or B change over time. Still under the assumption that E[B] > E[A], it now matters greatly in which order you use your samples.

I think the only reason you brought the Poisson model into the discussion is to weight the more recent samples higher and down weight the earlier samples in your basket of samples. This is a heuristic for considering a fixed interval in which the samples are stationary. It effectively considers a window that slides with the time of arrival.

1 comments

First let me mention that you should not try to use the Student's t-test. One of the first assumptions of the Student's t-test is that Each of the two populations being compared should follow a normal distribution. In A/B tests this assumption is almost never true, and therefore the Student's t-test is an inappropriate statistical assumption.

OK, now to what I said about Poisson distributions. Assuming that people arrive on a Poisson distribution allows us to conclude 2 key facts:

1. The statistics will behave exactly like it would if each person arrives IID from an infinite population.

2. Simpson's paradox will not apply to the theoretical distribution of the samples for A and B.

Assuming #1 without #2 does not get you very far. But having facts #1 and #2 allows us to use statistics.

I have no idea why you would speculate that I am attempting to weight recent samples higher and downweight earlier samples. All samples are, in fact, weighted exactly the same. This fact notwithstanding, different times of arrival are not weighted the same. That is because the sample rate fluctuates over time depending on factors such as traffic levels on your webserver. But it fluctuates in an identical way for the two versions. (This fact is critical in being able to conclude point #2.)

Does this help?

First, I was using Student t-test as a stand-in for whatever test or statistical measure you would like to use. I believe the popular one is Hoeffding's inequality in the bandit literature, hence the log term in the MAB algorithms. I agree this was a poor choice of example.

Second, I believe I am getting hung up on the fact that different arrival times are "weighted" differently. I think you are claiming that the Poisson assumption gives us equal numbers of A and B trials, so we can combined the statistics (counts) and avoid Simpson's paradox. This is fine, but why would you say "different times of arrival are not weighted the same". Does this mean you are somehow weighting periods of heavy traffic down and weighting low traffic up?

So, what happens when trial A becomes less favorable over time or is less favorable for brief periods? This means that the underlying random variable's mean is changing over time. Most statistical bounds cannot handle this situation.

I am not saying that A/B testing is not something we should do in general. I am saying that it is a good heuristic with very few provable properties compared with logarithmic regret MAB algorithms.

This is fine, but why would you say "different times of arrival are not weighted the same". Does this mean you are somehow weighting periods of heavy traffic down and weighting low traffic up?

You keep on reversing the exact point that I keep on making, and then fail to understand what I said. So I guess that I'll keep repeating the same point in different ways and hope that at some point you'll get it.

Why do I say reversing? Because the weight a time period gets is directly proportional to the expected traffic. Therefore each observation is weighted the same, and periods of heavy traffic are the ones that are weighted most heavily.

Anyways, let's suppose, for the sake of argument, that from 2 AM to 3 AM observations arrive at an average rate of 1 every 10 minutes. Suppose that from 8 AM to 9 AM that they arrive at an average rate of one per minute.

Then, on average, we expect to have 6 observations from the hour in the middle of the night, and 60 observations from the hour from 8 AM to 9 AM.

Thus when we calculate average returns across the entire interval, on average we'll have 10x as many observations from 8 AM to 9 AM. Therefore on average the latter time period will have 10x the impact on the final results.

The conclusion is that different time periods are naturally weighted differently. However the weighting is the same across the two different versions.

If you want to get more mathematical about it, suppose that r(t) is the average rate at which observations are arriving in our subgroups. (So r(t) is the same for versions A and B.) Suppose that cA(t) is the rate at which version A converts, and suppose that cB(t) is the rate at which version B converts.

Here is what I claim:

Average conversion of A = integral(r(t) * cA(t)) / integral(r(t))

Average conversion of B = integral(r(t) * cB(t)) / integral(r(t))

Therefore if at all points cA(t) < cB(t) then the difference between their conversion rates is:

integral(r(t) * cB(t)) / integral(r(t)) - integral(r(t) * cA(t)) / integral(r(t)) = integral(r(t) * cB(t) - r(t) * cA(t)) / integral(r(t))

which is always positive. (It should be noted that this analysis remains the same whether we're looking for a binary convert/no convert, or whether we're looking at a more complex signal, such as amount paid. If we add the complication that people entering the test may convert to payments at one or multiple later points, the analysis becomes more complicated, but the result remains the same.)

So, what happens when trial A becomes less favorable over time or is less favorable for brief periods? This means that the underlying random variable's mean is changing over time. Most statistical bounds cannot handle this situation.

As long as there is a consistent preference between A and B, fluctuations in either or both do not alter the validity of the statistical analysis. If the preference is not consistent then, of course, A/B testing stops being valid.

I am not saying that A/B testing is not something we should do in general. I am saying that it is a good heuristic with very few provable properties compared with logarithmic regret MAB algorithms.

The fact that you are not following this proof does not mean that the proof I am offering you is invalid. In fact A/B testing has provable properties that, in a common real-world situation, are _much_ better than current state of the art logarithmic regret MAB algorithms.

I am also claiming (without proof) that this deficiency in current MAB algorithms is fixable, at the cost of a constant factor worse regret in the ideal situation where conversion rates do not change.

Ok, you are making a point about sampling. In periods of high traffic you have more samples that will bias the calculation. But, since the number of samples will be consistent between A and B, everything is fine. Fine.

I believe further discussion will not be productive. But, I suggest if you have a proof about A/B testing as it relates to the MAB problem or even if you have a proof about A/B testing in general that drops the normally distributed assumption or tells you exactly when to switch from exploring to exploiting, I suggest you write it up and publish it on arXv.

Thanks for your time.

Yes, I was indeed making a point about sampling. Hopefully we're on the same page now.

But here is an honest question for you. Why do you think that it would make sense for me to try to write up and publish a paper on arXv?

My view is that doing so would take a considerable amount of work. And the real-world constraints that matter to my clients don't seem to be in a direction that academics care much about, so I can't see them getting particularly interested in it. So it does not seem like it positively impacts my life.

I say this as someone who has several publications to my name. This fact has only once mattered to me. That once was when I needed to get sign-off from my current employer to have something I did before they hired me get published. Unfortunately my employer at the moment was eBay, they had just purchased Skype, and the paper that I was publishing among other things implied that Skype was unlikely to be worth what eBay had paid for it. This was..not fun.

If some research mathematician particularly wanted to sit down, pick my brains, and try to formalize the real-world constraints I've observed in my clients, that would be fine by me. It would only be fair in that case for me to be listed as a co-author. But unless that happens, I'm not going to try to publish a formal paper.

I was encouraging you to write down your mathematical evidence of better performing algorithms, because I know from my experience that when I try to write down proofs, my assumptions and reasoning become clearer. And often a step in the proof or logic that seems trivial becomes less trivial once you actually try to lay out the proof.

I believe that one of two things will arise when you go to write up a mathematical proof. One, you will be dependent on a statistical test that makes a normal distribution assumption implicitly or explicitly. Two, you will dependent on a bound that only applies in the limit. There is a third possibility, which is the most interesting to me, which is that you are dependent on a period of stationarity in your data, in other words, the distribution you are measuring must be stationary.

The proofs in the MAB papers are not intentionally obtuse or ignoring mainstream statistical ideas. They are written that way to say very explicit things under well defined assumptions. Locking down assumptions and being very precise is what writing down a mathematical proof is all about.