Hacker News new | ask | show | jobs
by welder 684 days ago
And knowing beforehand when you won't get enough exposures to reach significance.

Not many people have enough traffic to A/B test small effects and reach significance without running the test for multiple years.

I don't use CUPED in my tests... how much can it reduce wait times?

2 comments

Strictly speaking you don't need to wait for some arbitrary significance threshold. I don't know why so many people treat website A/B tests as similar to carefully, traditional nhst controlled experiments. Website A/B testing is much better thought of as an optimization problem rather than a true hypothesis test.

What's really important if you want to improve a website via A/B testing is a constant stream of new hypotheses (i.e. new variants). You can call tests "early" so long as you have new tests lined up it boils don't to a classic exploitation/exploration problem. In fact, in early development rapid iteration often yields superior results to waiting for significance.

As a website matures and reaches closer to some theoretical optimal conversion point, then it starts becoming increasing important to wait until you are very certain of an improvement. But if you're just starting A/B testing, more iteration will yield greater success than more certainty.

> You can call tests "early"

Another way to say that is: you can randomly pick a winner

Of course at the extreme you are over tuning for exploitation but in practice it's never completely random. You always have some information about the probably winner, so long as the P(A>B|obs) is not 0.5

Taking a long time to reach "significance" just means there is a small difference between the two variants, so it's better to just choose one and the try the next challenger which might have a larger difference.

In the early stages of running A/B tests being 90% certain that one variant is superior is perfectly fine so long as you have another challenger ready. Conversely, In the later stages of a mature website when you're searching for minor gains you probably want a much higher level of certainty that then standard 95%.

In either case thinking in terms of arbitrary significance thresholds doesn't make that much sense for A/B testing.

This may be true when B is missing the “Try/buy” button.

But for incremental, smaller changes, calling early is probably gambling.

You don't want to do that if you have seasonality, or novelty effects.
I don't think CUPED is super useful if you just stratify your users properly before the experiment begins.
CUPED is easier than stratifying users. Or, probably, you mean post-stratification. Still, CUPED is easier, on my personal opinion :)