| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by encoderer 3236 days ago
	The trouble is disproving the null hypothesis. In your test, if one variant beats another, you take that as a weak signal that one may be better than the other. The data doesn't support this. Without applying a standard to your p-value, you cannot disprove the null hypothesis: that your variant is likely no better or worse. I'm not a statistician, but I've run a lot of b-tests.

2 comments

Silhouette 3236 days ago

You're ignoring closed's point that "a priori favors neither group A or B".

If you are starting from a neutral position, considering two possible alternatives with neither presumed to be more favourable than the other, then any statistical test based on using one outcome as null and the other as alternative hypothesis is fundamentally inappropriate. Any such test inherently favours one outcome over the other, rather than starting from a neutral position.

As closed is trying to explain, if you really do start from neutral then even a tiny number of data points is still better than no data at all. You shouldn't have too much confidence in whether you're really making the right decision, but if you have to make a decision, you are still more likely to make the right one if you go with what the data tells you, even if it's only telling you by a very small margin.

link

encoderer 3236 days ago

Ok so walk me through this in practice..

The way I see it, you need to prove that A is better than B by a sufficient margin to be distinguishable from pure noise.

So, imagine you put up a landing page with 2 variants. Each one gets 500 visitors. You have a conversion on one, but not the other. It's your suggestion here that there is some significance to that single conversion?

I think the problem is, you have no idea if that user would've converted had she landed on the opposite variant. That is, you can't disprove the idea that your test makes no impact at all.

link

Silhouette 3235 days ago

You're still thinking in terms of one version being the default and the other an alternative that must be positively proven to be better. If you are in a situation where you have cases A and B and no particular reason to believe a priori that either is more likely to be better than the other, that's a fundamentally different situation.

And in that situation, yes, if you run both versions with randomised visitors and you observe a small but non-zero sample where one converted and the other did not, that is evidence that one version may be better than the other. It's not particularly strong evidence, but it is a non-zero amount of evidence in one direction over the other, and that's better than the nothing at all that you had to separate the cases to start with.

Therefore, if you must make a choice about whether to adopt one version or the other at that stage, then in the absence of any better evidence, it is more likely that the version that has converted performs better than the version that has not and logically you should adopt the one that converted.

Of course in reality you would probably prefer to collect stronger evidence before making a decision if that is possible. But if it's not then, as closed wrote before, any information is better than no information at all.

link

encoderer 3235 days ago

Have you ever watched a test against a lot of traffic? In variants with 50k test, 50k control each day you can see wild swings from one day to the next, until you reach statistical significance.

I think you and the other guy want that single conversion to be evidence, but in reality, it's statistical noise.

A coin flip assigned that user to that variant. If they were going to convert anyway, you will be deriving meaning from pure coin flip chance, and you have no way of knowing with a single conversion whether this is true.

Again, it's not about going in with an assumption of which is better, it's about realizing that in split testing the biggest challenge is disproving the null hypothesis.

link

Silhouette 3235 days ago

I think you and the other guy want that single conversion to be evidence, but in reality, it's statistical noise.

It is evidence, just like any other properly collected data point. It's just very weak evidence, is what we're saying.

Of course in real world situations there may be a lot of variance and the correct answer may well turn out to be the other one. But in the absence of additional information, that is true for literally any number of samples that is less than whatever proportion of the population would give you absolute proof that your chosen answer is correct. If you have 50%-1 samples and every single one went with option A, you're still wrong if the other 50%+1 would have gone for option B.

What you're calling "noise" is an ill-defined concept. Qualitatively there is no difference for a result in a two-way test between a single sample and 50%-1. You still don't know for sure which answer is the right one. However, you're going to be much more confident about having the right answer in the latter case, which is what I think closed was trying to explain to you.

Again, it's not about going in with an assumption of which is better, it's about realizing that in split testing the biggest challenge is disproving the null hypothesis.

But if you're running a test with null and alternative hypotheses, you are going in with an a priori preference for one outcome over the other. You are literally saying that if the result is close enough, you will prefer not to reject the null hypothesis, and therefore whichever variation you have arbitrarily chosen to be your null hypothesis will be the answer.

That is self-evidently not a neutral assessment of option A vs. option B, and therefore there will be some cases where your test is more likely than not to make the wrong decision. In short, you are using an inappropriate test for the situation that closed was describing.

link

encoderer 3235 days ago

Alright, last comment from my side, just to clarify:

>> You are literally saying that if the result is close enough, you will prefer not to reject the null hypothesis, and therefore whichever variation you have arbitrarily chosen to be your null hypothesis will be the answer.

This is a misunderstanding. The null hypothesis is that your two variants have no statistical impact on conversion and any edge you see is just random. That is the hurdle you have to overcome to gain any useful direction from B testing.

GL!

link

closed 3236 days ago

If one variant beats another, even with very few observations, the data DOES support that one is better. It's just that you might not be very confident that one is better.

The key to understanding this situation statistically is by reframing the way you think about tests away from an all-or-nothing NHST, and toward either confidence intervals, or bayesian estimation.

That is, some kind of measure of (loosely) uncertainty around a parameter (or entire model) of interest.

link