Hacker News new | ask | show | jobs
by stdbrouw 2110 days ago
> bandits have higher statistical power, but also higher false-positive rate; false positives can be quite high cost since they cause thrash and require time to investigate if a feature that tested well does poorly in production

Not sure what you mean by this. Higher false-positive rate compared to what? And given that bandits do not run for a predefined amount of time but converge at a rate proportional to the evidence (as opposed to your typical AB-test), a higher rate at which point in time?

Perhaps you mean that, because bandits typically run longer, there's a higher chance that they'll select an alternative that offers only a marginal improvement on the status quo whereas short experiments would just say "nah, no evidence that one is better than the other" and thereby get rid of a lot of noise?

1 comments

Thank you - your comment is right and I conflated two things which are conceptually totally different.

For a given number of experiments and block of time (i.e. available samples over time), it's not useful to say that bandits have higher power / a worse FPR, bc the values are adjustable. F1 or AUC would probably be the right way to compare and it seems likely to me that bandits have better performing precision-recall curves. Basically, this is actually irrelevant to the point, and actually favors bandits.

I was totally thinking about the scenario you mentioned where the number of experiments are unconstrained and old experiments run long. Bandits will spend a lot of their bandwidth on very marginal improvements that are below the effect size cutoff that shorter fixed RCT will set. I think you can fix this with early stopping (or just stopping), so maybe it's not really an issue after all.

Thanks for helping clarify my thinking on this :)