Hacker News new | ask | show | jobs
by tech_ken 1055 days ago
If the main objection to constructing a real-time product monitoring system for A/B(C/D/E...) decisions is that optional stopping is bad why not throw away the null-hypothesis sig testing and instead treat the problem as a multi-armed bandit?
4 comments

I've built a multi-armed bandit system which lived alongside our A/B system.

1. Product didn't have any idea how to interpret its behavior and therefore never made any decisions based on it

2. Experimentation != product design. It's one thing to look at the results of a test, it's another thing to consider patterns of user behavior observed over months or years, which is what Product Analytics is actually for.

How does an typical NHST A/B system resolve this?
You tend to stop while the experiment is running, and then spend time looking at the results once it's done.

The real benefits here are getting a better understanding of what levers drive your product metrics, as you'll inevitably mess up the first n or so experiments (if I could give you only one piece of advice, it would be to use stratified randomisation, but everyone seems to have to make this mistake for themselves).

Advice appreciated but I'm exceedingly familiar with experimental design haha, what I understand far less well though is the integration of the toolset into a business/product development context. I can see how having a staggered cadence of stopping, reflecting on the experimental design, and making a decision is wise. But it still seems that you could perform the experiment using MAB to keep the profit motive happy (you don't want to waste potential click-throughs just because you like p-values, maybe tune it to be more conservative about shifting heavily to one arm) and then have some period where you stop the experiment to pause and reflect.

Heck you don't even have to do MAB if you don't want to, just don't use NHST. The Bayesian "flavor" of NHST (credible intervals around posterior expected values) has absolutely no problem with optional stopping. Run the experiment until you've got a precise enough estimate, then sit back and make your product decisions.

I guess where I'm going with all this is that it seems like the post's strongest point is "good product decisions require time, and realtime analytics bamboozle us into thinking fast decisions are better". All the stuff about NHST seems kind of tangential. Looking at it again I see that it's like a decade old, so I think this is the best explanation for why they were targeting NHST more aggressively. I would hope in our post-replication crisis world (hopefully "post", anyways) data scientists and A/B testers are more prudent about some of these better-known pitfalls.

MAB and its friends like contextual MAB has always been the dream. Closing the loop so analytics data is pushed back to the decision point in code and isn't a one-way pipe to some dashboard is the hardest part though. For non-technical reasons.
Sort of a generalized PEBCAK
Because it is difficult to map that onto real business decisions and requires oftentimes supporting a large space of possible UI combinations because they haven't been fully ruled out yet.
Doesn’t that problem also exist with NHST based A/B testing?
I think business decisions map very well onto the binary decisions implied by NHST A/B testing, which is partly why we put so many resources into studying those problems in the early 20th century.
How well does that dodge the problem? I'd imagine a multi armed bandit should stay such that it is always sampling from many fair coins, as it were. I would be delighted to read a study on that.
I can’t say that I did the proof out, but intuitively I would expect the posterior distribution over arm-probabilities would converge to something equal? The other option is spurious convergence to a bad posterior, which could maybe happen with poor sampling techniques, but I can’t imagine it’s more than an edge case
Right, that is what I meant about it should continue to sample from fair coins. I don't know that I've seen experiments to see how long that takes, though.

There is also the question of how long you'd leave multiple treatments out there. Presumably, even if there is no difference in outcomes, there can be benefits to having fewer deployed behaviors.

I'm now also curious if there are non-transitive situations. For example, three treatments together that all act fair if all deployed, but for reasons any two of them deployed alone will show a preference. Ideally, of course, treatments should be done such that this can't happen, but mistakes are often made.

Edit: Fully cede that this is likely chasing edges. The motivation for fewer deployed arms is far more compelling than the edge cases.