| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by joe-stanton 3292 days ago
	This is a really useful article. It's a shame that so much development time is wasted on large numbers of fruitless optimisations just because they are "easy" (eg. tweaking the colour of a CTA). That being said, I'm surprised many of the results are so negative. It would be great to also see the max uplift achieved for each category. A number of retailers I've worked with have been able to beat these uplifts by quite a bit. I wonder if it might be significantly skewed by the kind of clients Qubit has?

2 comments

sweezyjeezy 3292 days ago

Two things - firstly, each of the scores you see in the key findings are just the average. We have also estimated the size of the standard deviation (see table in section 2, or appendix A). So for some treatments, large uplifts are not out of the question.

Maybe more importantly - every A/B test ever run suffers from measurement error, and usually in e-commerce this error is on the scale of the effect you are trying to measure. This means that sometimes you will 'see' massive uplifts, where in actuality most of the size of the effect was due to random noise. This is kind of the curse of e-commerce : most people have enough data to say something (we are 95% sure this test was positive), but most not with any notable precision (we are 95% sure the uplift was between +8% and +9%). Basically all the stats in this analysis is trying to remove this noise, and this is what we got.

link

gwern 3292 days ago

Great to see a multilevel model used to shrink the effects. I was reading the abstract and thought, they probably didn't correct for sampling error - but you did.

I'm not an expert on plate notation, though, so I'm not sure which MLM you used. Is it basically `Revenue ~ (Covariate_1 + ... + Covariate_n | Treatment | Category)`?

link

Silhouette 3292 days ago

It would be great to see the max uplift achieved for each category

Indeed. What matters most with these kinds of experiments isn't really the average results, but what is possible and the distribution among beneficial results only. After all, the whole point of A/B testing is to try experiments and then either keep the changes if they improve results or stay with what you've already got if the changes didn't bring an improvement. Surely all the treatments that led to negative changes would just have been discarded in practice? It's still important to see the full picture as well, if only to guide decisions about which experiments are even worth trying, but I think there's another side that doesn't fully come through here.

link

jacquesm 3292 days ago

I think the big error in A/B testing is that expectations are quite often very unrealistic. Designers typically have a reasonably good idea about what will work and what will not. Finding 'million dollar buttons' is rare. Of course a couple of percent or even 10's of percents of improvement is nothing to sneeze at. But thinking that by A/B testing forever you're going to make a shrub grow into a tree is imo not realistic. Aside from the detail that a continuously changing user interface is often in itself a barrier to sales.

Ironically, the companies that have benefited most from A/B testing were the ones that were doing a terrible job of it in the first place so then there is lots of low hanging fruit making the consultants look good.

Yet another item often missed: A/B testing success is a direct function of the length of the lever you are pulling. If that lever commands billions of dollars then it is easy to make it pay for itself. But if you're trying to turn $10000 into $11500 then you likely are wasting your time.

link

gwern 3292 days ago

> isn't really the average results, but what is possible and the distribution among beneficial results only

No, the bad results also matter: you are still spending visitors and revenues in testing out bad variants, which is part of determining the costs and benefits. Even with a bandit approach, you incur logarithmic regret in the number of variants. And testing a bad variant is common: the best category, 'scarcity', has a 16% probability of the variant being harmful. A Value of Information calculation has to take into account the harm done while testing.

link

Silhouette 3292 days ago

(Hence the final sentence of my previous comment.)

link