Hacker News new | ask | show | jobs
by sl8r 2903 days ago
> Why would that be wrong?

The issue is that modeling B with a distro centered around 2.5% ignores what we know about the historical conversion rate (2.0%) and the control bucket's conversion rate (also 2.0%). If our goal is to make the best estimate for the future that we can, we should take this data into account when evaluating B. As a thought experiment, imagine that you have A at 2.0% and B at 2.5% conversion for Week 1, with a historical conversion rate of 2.0%. Someone says they'll pay you $100 if you correctly guess what B's conversion rate will be next week, either (i) in the range 2.0% to 2.5%, or (ii) in the range 2.5% to 3.0%. I'd prefer to bet on (i) than on (ii).

> What would a Bayesian conclude instead?

One simple approach would just be to start with a more informative prior, like Beta(2+1,100-2+1) instead of Beta(1,1). This would pull bucket B's posterior distribution closer to 2.0%. Another approach is to use a hierarchical model [1], which will fit the individual buckets' priors for you.

[1] Here's something I wrote on this a couple years ago, more focused on solving multiple comparisons problems but with the same proposed solution: http://normal-extensions.com/2014/07/16/ab-testing-hierarchi...

1 comments

> The issue is that modeling B with a distro centered around 2.5% ignores what we know about the historical conversion rate (2.0%) and the control bucket's conversion rate (also 2.0%).

Both the historical and the control bucket used version A of the website, and they are consistent in their 2.0% conversion rate. Version B is different, and it appears to have a different conversion rate of 2.5%. So why should it not have a future conversion rate close to 2.5%?

Let's replace the website with a 6-sided die. Historically, the probability of throwing a 3 was 1/6. Now you replace your die with a different die and throw it 10,000 times; the 3 comes up 2560 times. If I had to guess how many times the 3 comes up the next 10,000 throws, I certainly would bet that it's closer to 2560 times than to 1667 times.

> Someone says they'll pay you $100 if you correctly guess what B's conversion rate will be next week, either (i) in the range 2.0% to 2.5%, or (ii) in the range 2.5% to 3.0%.

Case A: The historical version A of the online shop had some influence on the conversion rate during the testing of version B, drawing the conversion rate of B down. This influence will fade away in the future, so B's conversion rate will be closer to [2.5%, 3.0%] than to [2.0%, 2.5%].

Case B: The historical version A of the online shop did not have any influence on the conversion rate during the testing of version B (compare the dice example above). Then both ranges are equally plausible. But "[2.0%, 2.5%] vs [2.5%, 3.0%]" is a bad dichotomy. A more relevant one would be "[1.75%, 2.25%] vs [2.25%, 2.75%]". In that case, I would bet on [2.25%, 2.75%].

Late to the party, but:

> Both the historical and the control bucket used version A of the website, and they are consistent in their 2.0% conversion rate. Version B is different, and it appears to have a different conversion rate of 2.5%. So why should it not have a future conversion rate close to 2.5%?

It's all a matter of degree. You'd model B's rate as closer to 2.5%, but probably not centered around 2.5%. As you observe more data, the prior becomes less important. E.g., with 10k samples as in the original example, if you used Beta(2+1,100-2+1) as your prior, your posterior would be Beta(252+1, 10100-2+1) as your posterior, which is centered at 2.495%. But if you only had 1000 samples (and 25 conversions), you'd get a distro centered at 2.45%. And if you only had 200 samples (and 5 conversions), you'd get a distro centered at 2.33%. Etc.

> Let's replace the website with a 6-sided die. Historically, the probability of throwing a 3 was 1/6. Now you replace your die with a different die and throw it 10,000 times; the 3 comes up 2560 times. If I had to guess how many times the 3 comes up the next 10,000 throws, I certainly would bet that it's closer to 2560 times than to 1667 times.

In the case of a die where you believe any weighting of the faces is equally likely, this would be true. So this may be an appropriate model in this case. But in the case of the website, I don't think the conversion rates are equally likely, even for a new, un-tested site. If the historical conversion rate is 2.0%, and I'm forced to bet on the most likely conversion for a new (never before seen) variant B, I'd much rather bet on a number near 2.0% than a number like 99%.

> Case B: The historical version A of the online shop did not have any influence on the conversion rate during the testing of version B (compare the dice example above). Then both ranges are equally plausible.

This is exactly what I'm claiming is not true. It's not that A influences B, it's that A tells you something about the likely range of A and B (in this specific case of an e-commerce site). (The reason I chose the ranges [2.0%, 2.5%] vs [2.5%, 3.0%] is that if you model B independently, you'd be indifferent between these ranges; but if you use A to inform a prior, you'd prefer [2.0%, 2.5%].)