| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by AlexeyMK 1007 days ago

> Nooo! First, if one actually works, you’ve massively increased the “noise” for the other experiments

I get that a bunch at some of my clients. It's a common misconception. Let's say experiment B is 10% better than control but we're also running experiment C at the same time. Since C's participants are evenly distributed across B's branches, by default they should have no impact on the other experiment.

If you do a pre/post comparison, you'll notice that for whatever reason, both branches of C are doing 5% better than prior time periods, and this is because half of them are in the winner branch of B.

NOW - imagine that the C variant is only an improvement _if_ you also include the B variant. That's where you need to be careful about monitoring experiment interactions, I called out in the guide. But better so spend a half day writing an "experiment interaction" query than two weeks waiting for the experiments to run in sequence.

> Second, xkcd 882 (https://xkcd.com/882/) I think you're referencing P-hacking, right?

That is a valid concern to be vigilant for. In this case, XKCD is calling out the "find a subgroup that happens to be positive" hack (also here, https://xkcd.com/1478/). However, here we're testing (a) 3 different ideas and (b) only testing each of them once on the entire population. No p-hacking here (far as I can tell, happy to learn otherwise), but good that you're keeping an eye out for it.

1 comments

yorwba 1006 days ago

The more experiments you run in parallel, the more likely it becomes that at least one experiment's branches do not have an even distribution across all branches of all (combinations of) other experiments.

And the more experiments you run, whether in parallel or sequentially, the more likely you're to get at least one false positive, i.e. p-hacking. XKCD is using "find a subgroup that happens to be positive" to make it funnier, but it's simply "find an experiment that happens to be positive". To correct for p-hacking, you would have to lower your threshold for each experiment, requiring a larger sample size, negating the benefits you thought you were getting by running more experiments with the same samples.

link

lukego 1006 days ago

... and one such correction is the (simple, conservative, underused) Bonferroni Correction.

link

AlexeyMK 1006 days ago

Super helpful - looked it up, will aim to apply next time!

Curious how the bonferroni correction applies in cases where the overlap is partial - IE, experiment A ran from Day 1 to 14, and experiment B ran (on the same group) from days 8 to 21. Do you just apply the correction as if there was full overlap?

link

lukego 1006 days ago

I believe you would apply the correction for every comparison you make regardless of the conditions. It's a conservative default to avoid accidentally p-hacking.

There might be other more specific corrections that give you power in a specific case. I don't know about that, I went Bayesian somewhere around this point myself.

link

RandomLensman 1006 days ago

There are a bunch of procedures under the label Family-wise Error Correction, some have issues in situations with non-independence (Bonferoni can handle any dependency structure, I think).

If there are a lot of tests/comparisons could also look at controlling for the False Discovery Rate (usually increases power at the expense of more type I errors).

link

AlexeyMK 1006 days ago

Thanks, that is a well reasoned argument!

My take is for small n (say 5 experiments at once) with lots of subjects (>10k participants per branch) and a decent hashing algorithm, the risk of uneven bucketing remains negligible. Is my intuition off?

False positives for experiments is definitely something to keep an eye on. The question to ask is what is our comfort level for trading-off between false positives and velocity. This feels similar to the IRB debate to me, where being too restrictive hurts progress more than it prevents harm.

link

bertil 1006 days ago

No, the risk of uneven bucketing of more than 1% is minimal, and even when it’s the case, the contamination is much smaller than other factors. It’s also trivial to monitor at small scales.

False positives do happen (Twyman's law is the most common way to describe the problem: underpowered experiment with spectacular results). The best solution is to ask if the results make sense using product intuition and continue running the experiment if not.

They are more likely to happen with very skewed observations (like how much people spend on a luxury brand), so if you have a goal metric that is skewed at the unit level, maybe think about statistical correction, or bootstrapping confidence intervals.

link

bertil 1006 days ago

You are confusing:

a. the Family-Wise Error Rate (FWER what xkcd 882 is about) and the many solutions of Multiple Comparison Correction (MCC: Bonferoni, Homes-Sidak, Benjamini-Hochberg, etc.) with

b. Contamination or Interaction: your two variants are not equivalent because one has 52% of its members part of Control from another experiment, while the other variant has 48%.

FWER is a common concern among statisticians when testing, but one with simple solutions. Contamination is a frequent concern among stakeholders, but very rare to observe even with a small sample size, and that even more rarely has a meaningful impact on results. Let’s say you have a 4% overhang, and the other experiment has a remarkably large 2% impact on a key metric. The contamination is only 4% * 2% = 0.08%.

It is a common concern and, therefore, needs to be discussed, but as Lukas Vermeer explained here [0], the solutions are simple and not frequently needed.

[0] https://www.lukasvermeer.nl/publications/2023/04/04/avoiding...

link