|
|
|
|
|
by malf
1015 days ago
|
|
“Using modern experiment frameworks, all 3 of ideas can be safely tested at once, using parallel A/B tests (see chart).” Nooo! First, if one actually works, you’ve massively increased the “noise” for the other experiments, so your significance calculation is now off. Second, xkcd 882. |
|
I get that a bunch at some of my clients. It's a common misconception. Let's say experiment B is 10% better than control but we're also running experiment C at the same time. Since C's participants are evenly distributed across B's branches, by default they should have no impact on the other experiment.
If you do a pre/post comparison, you'll notice that for whatever reason, both branches of C are doing 5% better than prior time periods, and this is because half of them are in the winner branch of B.
NOW - imagine that the C variant is only an improvement _if_ you also include the B variant. That's where you need to be careful about monitoring experiment interactions, I called out in the guide. But better so spend a half day writing an "experiment interaction" query than two weeks waiting for the experiments to run in sequence.
> Second, xkcd 882 (https://xkcd.com/882/) I think you're referencing P-hacking, right?
That is a valid concern to be vigilant for. In this case, XKCD is calling out the "find a subgroup that happens to be positive" hack (also here, https://xkcd.com/1478/). However, here we're testing (a) 3 different ideas and (b) only testing each of them once on the entire population. No p-hacking here (far as I can tell, happy to learn otherwise), but good that you're keeping an eye out for it.