| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by placidpanda 1524 days ago

> What I don't understand is why power would be so relevant.

Doing the A/B test itself has a cost greater than just building the feature and releasing it (supporting two variants in production), and beyond that you also need to take into account the cost of acting on the results (i.e. if control wins what do you do? if it's a tie what do you do? best to budget for maximum possible effort, or the expected effort -- but expecting for the variant to win handily is budgeting for the minimum possible effort).

I've seen multiple businesses that always schedule around shipping an A/B test and context switching to the next project while the results stream in. Any result that isn't shipping the variant after x weeks is a huge inconvenience that throws off multiple teams, which means all those cognitive biases start to creep in and make it comfortable to declare loser variants as wins or ties.

While it's easy to write this behavior off as yet another way that groups make irrational decisions, I think the bit of truth in there is that sometimes, the cost of running the strictest, science-iest A/B test is simply too high. Power is a key part of how you reason that out up front, so you can make a rational decision not to test, or to modify your test to make the payoff worth it. For example:

* Let's set the goal metric for something higher up the funnel which is further from our true goal (more $) but happens much more often, so we can see the effect in 1 week instead of 2 months

* We really need to do this for [variety of business strategical decisions], so let's structure our experiment to make sure it won't cost us more than $X in a worst case scenario and find out in a few days rather than wait 2 months

> I now know that B is better than A.

You know that B outperformed A in the experiment. Checking statistical significance is like asking a trustworthy person "Are you sure?" and them saying "Yeah, I'm pretty sure". It's a percentage because it's sometimes wrong, and this doesn't account for the massive amount of real world factors that can still mean an experiment conducted with bulletproof math behind the analysis is still taking people down the wrong path.