Hacker News new | ask | show | jobs
by yichijin 2894 days ago
Hi, Jimmy from Optimizely here. The practice you describe is actually perfectly fine, so long as you're not using a method designed to be checked at a single point in time.

Take a look at clinical trials. Often in clinical trials there are multiple phases, where early stopping is desirable in case the drug has higher-than-expected efficacy (or more-harmful-than-expected side effects).

The type of test conducted in clinical trials explicitly allow for multiple looks while maintaining correct control of the Type 1 error rate. At Optimizely we essentially have a version of this where the monitoring can be conducted contiuously with rigorous control of Type 1 error.

Check out this paper for more details: http://www.kdd.org/kdd2017/papers/view/peeking-at-ab-tests-w...

3 comments

Caveat emperator, I am reading the first pages of the article you link to. In page 1519 they say 1-alpha is the desired significance level. This is wrong, perhaps they mean that alpha is the significance level and 1-alpha is the desired confidence level. In step 3 they say: Preferred test statistics are the ones that can control Type I errors. But that is wrong, Type I error is a parameter you fix so it is not related to the test statistic. Later giving examples of uniform most powerful statistics they require data following a normal distribution, but in web data the distribution can be a mixture of normals whose means depend of the hour. So perhaps the examples are not realistic in the web setting. To be continued.
> Hi, Jimmy from Optimizely here. The practice you describe is actually perfectly fine, so long as ...

Lotsa things are OK so long as you are doing X and Y etc.

Take a look at a portion clinical trial[1] guidance from FDA. Note specifically the basic Stats guidance:

    6.9.1 A  description of  the statistical methods to
      be  employed, including timing of  any planned
      interim analysis(ses).

    6.9.2 The number of subjects planned to be
      enrolled. In multicenter trials, the numbers of
      enrolled subjects projected for each trial site
      should be  specified. Reason for choice of
      sample size, including reflections on (or
      calculations of) the power of the trial and
      clinical justification.
I don't it's recommended practice anywhere to start collecting data, do a simple t-test after each observation, and declare a significant difference after p < 5%.

Of course, if every other patient is suffering serious consequences, or becoming miraculously well on the second day of the trial, you stop. In those cases, you generally don't need a statistical test to tell you that your a priori evaluation of the drug or intervention was wrong.

I fail to see what is so vital about some web site A/B test that one cannot be bothered to think ahead about what defines an observational unit, how many of those one might need to detect an improvement, and wait until after that sample has been attained to test (and, if the web site doesn't get enough visitors to fulfill your sample size requirement for that particular test, that is a different problem entirely).

[1]: https://www.fda.gov/downloads/Drugs/GuidanceComplianceRegula...

Presumably using your method takes longer/requires more samples than a method that only checks once?
I haven't looked at the KDD paper, but in general it is the other way round. With sequential hypothesis testing expect to need less data on average.
That's highly counter-intuitive to me. Can you try to motivate why that's the case?

My intuition is that you could use any sequential (which I translated to online) technique could be used in a non-sequential context. By that reasoning, there's no way a sequential technique could do better, at best it could be the same.

This is 1940s stuff. Checkout Wald.

Short answer: in sequential testing you can ask at intermediate stages whether a satisfactory confidence has been reached. If yes you are done and if not you can continue. On average you will hit a 'yes' sooner. For non sequential you cannot do this if you care about correctness (). So the sample size needs to be pessimistic for non-sequential protocols and then you are bound to that commitment.

() If your method ensures correctness even after inspection at intermediate stages then its a sequential method by definition. There is some confusion in literature about Bayesian and sequential. They are orthogonal concepts. Both Bayesian and Frequentist test of hypothesis can be sequential

Ah! I get it. Thank you!