Hacker News new | ask | show | jobs
by demopathos 1519 days ago
What I don't understand is why power would be so relevant. I want to know if going from A to B would increase my revenue. I run an A/B test and see statistical significance, even if a minor one. I now know that B is better than A.

I suppose the need for a power calculation comes in when considering effort. If I need 10 engineers for a month to build out a feature that won't get the power it needs for a year, it may not be worth it.

4 comments

Power is important if you want to distinguish between noise and true signals. In other words, if you care about P(real effect | significant).

From good ol' Bayes we get:

P(real|sig) = P(sig|real) x P(real) / P(sig)

P(sig|real) is the power; so if you have more power, all other things being equal (a bit of a weaselly caveat), the likelihood that your stat sig result is real is higher.

> What I don't understand is why power would be so relevant.

Doing the A/B test itself has a cost greater than just building the feature and releasing it (supporting two variants in production), and beyond that you also need to take into account the cost of acting on the results (i.e. if control wins what do you do? if it's a tie what do you do? best to budget for maximum possible effort, or the expected effort -- but expecting for the variant to win handily is budgeting for the minimum possible effort).

I've seen multiple businesses that always schedule around shipping an A/B test and context switching to the next project while the results stream in. Any result that isn't shipping the variant after x weeks is a huge inconvenience that throws off multiple teams, which means all those cognitive biases start to creep in and make it comfortable to declare loser variants as wins or ties.

While it's easy to write this behavior off as yet another way that groups make irrational decisions, I think the bit of truth in there is that sometimes, the cost of running the strictest, science-iest A/B test is simply too high. Power is a key part of how you reason that out up front, so you can make a rational decision not to test, or to modify your test to make the payoff worth it. For example:

* Let's set the goal metric for something higher up the funnel which is further from our true goal (more $) but happens much more often, so we can see the effect in 1 week instead of 2 months

* We really need to do this for [variety of business strategical decisions], so let's structure our experiment to make sure it won't cost us more than $X in a worst case scenario and find out in a few days rather than wait 2 months

> I now know that B is better than A.

You know that B outperformed A in the experiment. Checking statistical significance is like asking a trustworthy person "Are you sure?" and them saying "Yeah, I'm pretty sure". It's a percentage because it's sometimes wrong, and this doesn't account for the massive amount of real world factors that can still mean an experiment conducted with bulletproof math behind the analysis is still taking people down the wrong path.

Not just effort: there usually are many costs associated with change. Perhaps user disorientation, downtime, as well as a lack of long term understanding of how the change plays out in combination with other factors.

A small improvement may not be worth it right now.

Perhaps it can be deferred. Perhaps it makes sense to bundle it with other changes later. *

* Some changes might reflect better on a product when they are rolled out together; e.g. to signal a major release.

>I run an A/B test and see statistical significance

That’s exactly what power gives you: a fighting chance to detect anything. Not sure where the misunderstanding comes from – it’s about the worthwhileness of tests themselves, not “features”.

You may think about power visually by relating them to confidence intervals – higher power → more precise (expected) estimate. Low power → bands so wide that you can just as well use an RNG.