Hacker News new | ask | show | jobs
by yichijin 2894 days ago
Hey all, statistician from Optimizely chiming in here. Just wanted to point out that this is exactly the right point.

I wanted to add one detail--there actually are ways to do early stopping while staying within a frequentist approach. For example, most clinical trials methods are not Bayesian but rather are just fixed-horizon tests that have the allowable amount of Type 1 error "spread out" amongst the multiple looks that are planned in advance.

At Optimizely we essentially have a continuous version of this that does in fact allow for multiple looks with rigorous control of Type 1 error. As tedsanders mentions, the key upside is that if you start an experiment with a larger-than-expected lift, you can terminate it early. Then over many repeated experiments, you gain a lot in terms of average time to significance.

The dissonance in this discussion mostly stems from the fact that this paper (which we actually collaborated on!) uses data from 2014, before we rolled out this new Stats Engine.

For more, I would encourage a look at our paper: http://www.kdd.org/kdd2017/papers/view/peeking-at-ab-tests-w...

1 comments

What's the tradeoff vs. just taking a direct Bayesian approach?

In fact, why use an inferential framework at all (estimating some sort of probability and using it to guide action), rather than directly using a policy learning framework, e.g. modeling this as Q-learning or multi-armed bandit problem?

If at the end of the day you have some objective function (e.g. 'making money'), some known space of actions (e.g. move this widget up the page, change the color, engage with user this way), and a reasonable way to associate those two, then isn't the company literally doing reinforcement learning over time?

It seems one benefit of a reinforcement learning framework is it maintains a set of actions that will still be explored in the future without forcing you to prematurely 'choose' whether A or B is actually better—if A is better in reality, then it will be explored more and more often and B will progressively become downweighted over time.

> If at the end of the day you have some objective function

That "If" often evaluates to false.

There are tough judgement calls involved in selecting what is that metric that the org wants to optimize. It is very rare that business management commits to a clear quantitative goal. Reasons are many -- weasel room is important politically, selecting a metric that captures short term and long term goals is difficult, there is a lot of uncertainty in the costs due to uncertainty on how overhead should be billed etc etc.

This is fairly common. Typically, in these situations its the PMs who make the final call. There the goal of the experiment is to glean as much knowledge as possible, and present it to the PM. If that comes at the cost of exposing some customers to bad choices, so be it -- in other words, explore at the cost of losses in the opportunity to exploit.

> why use an inferential framework at all

Probably because of the maintenance cost of the code that was only explored but never exploited.

A policy learning approach is better imho - but getting people to switch to using a multi-armed bandit when they are used to AB testing can be difficult.

People don't seem to trust the system to make the right decisions even though you can do simulations and have the mathematics to show it is correct.