| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by encoderer 3236 days ago
	You are just using noise then. It's not a matter of opinion, it's statistics.

2 comments

closed 3236 days ago

If you are waiting for N observations, so that a NHST will have some level of power, and you assume each observations is drawn from the same distribution (as your test likely does), then you do not see each observation as noise.

You will just be acting under reduced certainty, but if you have to act, any information is better than no information.

(I'd be very interested to hear your statistical explanation).

link

encoderer 3236 days ago

The trouble is disproving the null hypothesis. In your test, if one variant beats another, you take that as a weak signal that one may be better than the other. The data doesn't support this. Without applying a standard to your p-value, you cannot disprove the null hypothesis: that your variant is likely no better or worse.

I'm not a statistician, but I've run a lot of b-tests.

link

Silhouette 3236 days ago

You're ignoring closed's point that "a priori favors neither group A or B".

If you are starting from a neutral position, considering two possible alternatives with neither presumed to be more favourable than the other, then any statistical test based on using one outcome as null and the other as alternative hypothesis is fundamentally inappropriate. Any such test inherently favours one outcome over the other, rather than starting from a neutral position.

As closed is trying to explain, if you really do start from neutral then even a tiny number of data points is still better than no data at all. You shouldn't have too much confidence in whether you're really making the right decision, but if you have to make a decision, you are still more likely to make the right one if you go with what the data tells you, even if it's only telling you by a very small margin.

link

encoderer 3236 days ago

Ok so walk me through this in practice..

The way I see it, you need to prove that A is better than B by a sufficient margin to be distinguishable from pure noise.

So, imagine you put up a landing page with 2 variants. Each one gets 500 visitors. You have a conversion on one, but not the other. It's your suggestion here that there is some significance to that single conversion?

I think the problem is, you have no idea if that user would've converted had she landed on the opposite variant. That is, you can't disprove the idea that your test makes no impact at all.

link

Silhouette 3235 days ago

You're still thinking in terms of one version being the default and the other an alternative that must be positively proven to be better. If you are in a situation where you have cases A and B and no particular reason to believe a priori that either is more likely to be better than the other, that's a fundamentally different situation.

And in that situation, yes, if you run both versions with randomised visitors and you observe a small but non-zero sample where one converted and the other did not, that is evidence that one version may be better than the other. It's not particularly strong evidence, but it is a non-zero amount of evidence in one direction over the other, and that's better than the nothing at all that you had to separate the cases to start with.

Therefore, if you must make a choice about whether to adopt one version or the other at that stage, then in the absence of any better evidence, it is more likely that the version that has converted performs better than the version that has not and logically you should adopt the one that converted.

Of course in reality you would probably prefer to collect stronger evidence before making a decision if that is possible. But if it's not then, as closed wrote before, any information is better than no information at all.

link

encoderer 3235 days ago

Have you ever watched a test against a lot of traffic? In variants with 50k test, 50k control each day you can see wild swings from one day to the next, until you reach statistical significance.

I think you and the other guy want that single conversion to be evidence, but in reality, it's statistical noise.

A coin flip assigned that user to that variant. If they were going to convert anyway, you will be deriving meaning from pure coin flip chance, and you have no way of knowing with a single conversion whether this is true.

Again, it's not about going in with an assumption of which is better, it's about realizing that in split testing the biggest challenge is disproving the null hypothesis.

link

closed 3236 days ago

If one variant beats another, even with very few observations, the data DOES support that one is better. It's just that you might not be very confident that one is better.

The key to understanding this situation statistically is by reframing the way you think about tests away from an all-or-nothing NHST, and toward either confidence intervals, or bayesian estimation.

That is, some kind of measure of (loosely) uncertainty around a parameter (or entire model) of interest.

link

sobani 3234 days ago

The question then is:

Is the available data more useful than a coin-flip, which would be the alternative method of making a decision.

On the other hand, a coin-flip is probably the better tool. If you can't generate enough data for a statistical sample, then you're probably wasting your time creating an alternative version and setting up an A/B test.

link