Hacker News new | ask | show | jobs
by Silhouette 3292 days ago
It would be great to see the max uplift achieved for each category

Indeed. What matters most with these kinds of experiments isn't really the average results, but what is possible and the distribution among beneficial results only. After all, the whole point of A/B testing is to try experiments and then either keep the changes if they improve results or stay with what you've already got if the changes didn't bring an improvement. Surely all the treatments that led to negative changes would just have been discarded in practice? It's still important to see the full picture as well, if only to guide decisions about which experiments are even worth trying, but I think there's another side that doesn't fully come through here.

2 comments

I think the big error in A/B testing is that expectations are quite often very unrealistic. Designers typically have a reasonably good idea about what will work and what will not. Finding 'million dollar buttons' is rare. Of course a couple of percent or even 10's of percents of improvement is nothing to sneeze at. But thinking that by A/B testing forever you're going to make a shrub grow into a tree is imo not realistic. Aside from the detail that a continuously changing user interface is often in itself a barrier to sales.

Ironically, the companies that have benefited most from A/B testing were the ones that were doing a terrible job of it in the first place so then there is lots of low hanging fruit making the consultants look good.

Yet another item often missed: A/B testing success is a direct function of the length of the lever you are pulling. If that lever commands billions of dollars then it is easy to make it pay for itself. But if you're trying to turn $10000 into $11500 then you likely are wasting your time.

> isn't really the average results, but what is possible and the distribution among beneficial results only

No, the bad results also matter: you are still spending visitors and revenues in testing out bad variants, which is part of determining the costs and benefits. Even with a bandit approach, you incur logarithmic regret in the number of variants. And testing a bad variant is common: the best category, 'scarcity', has a 16% probability of the variant being harmful. A Value of Information calculation has to take into account the harm done while testing.

(Hence the final sentence of my previous comment.)