Hacker News new | ask | show | jobs
by leo_pekelis 4163 days ago
Hello, Leo, Optimizely's in-house statistician here. The graph you reference is a schematic to show the differences between Optimizely’s previous statistical platform and Stats Engine. It shows a monotone non-decreasing significance because under our sequential testing framework, the significance value represents the total amount of accumulated evidence against the null hypothesis of no difference between a variation and baseline. This wealth of evidence cannot decrease because you can only get more information about your test as you get more visitors. Of course, it is very possible, as the graph shows, that you will not acquire enough contradictory evidence to reach significance in a reasonable number of visitors. What would have happened if we instead looked continuously at a classical t-test, is the significance would oscillate near the significance threshold. Spurious deviations would cause multiple, contradictory declarations that the test is significant and then not. A savvy A/B tester might wait until the oscillations die down. Sequential testing is a principled, mathematical way to differentiate evidence against the null hypothesis from random oscillations in real time. It should be noted that the chance of a type I error is still controlled at 5%.

You do make a good point that sometimes an A/B test will see regression over time. We have explicitly separated this out because we feel detecting a change in the underlying effect size is different from testing whether the effect is non-zero, and different statistical methods are better suited to one over the other. We have built a policy into our framework that monitors for such temporal effects and signals an A/B test is in a ‘reset’ when we discover them. In our historical database, this happened on about 4% of tests.

I concede all this is a lot to get across in one graph, but we do feel that it is a good representation of how significance behaves under Stats Engine. If you would like to read more about the math behind stats engine, here is a link to a full technical article: http://pages.optimizely.com/rs/optimizely/images/stats_engin...

2 comments

> It shows a monotone non-decreasing significance because [the value] represents the total amount of accumulated evidence against the null hypothesis.

> if we instead looked continuously at a classical t-test, is the significance would oscillate near the significance threshold

So there's your answer: the y-axis on the chart has an unlabeled different meaning for the blue line.

While I have you here Leo, can you explain why you would want to chart only the accumulated evidence for X? It's meaningless without knowing how much evidence has been accumulated for not X.

One point of clarification, the y-axis on the chart does have the same meaning for both lines. It is 1 minus the chance of committing a type I error. I think you do point out an important nuance that under sequential testing a type I error changes to “ever detecting a significant result on an insignificant test” instead of just at one, predetermined visitor count.

The amount of accumulated evidence for X is exactly a p-value, or a measurement which can tell you if there is enough evidence in the experiment to contradict a hypothesis of “no difference between a baseline and variation.” A high p-value, or low significance tells you there is a lack of evidence to make this claim.

You bring up a very interesting point which is that with sequential testing it is actually possible to also look for evidence of ‘not X’ or that there really is no detectable difference. This works by ‘flipping the hypothesis test on it’s head’ and allows for a mathematical formulation of stopping early for futility. We do not currently offer this in Stats Engine because we believe it’s the less important quantity of the two, but it may be the focus of future research.

Sorry for the delay in responding. HN thinks we are responding too fast to comments over here. =)
You keep using that word "classical." If by "classical" you mean frequentist, then sequential testing is the appropriate frequentist method to have been using all along. If by "classical," you mean "old and established," then sequential testing is still the appropriate method to have been using all along.