|
|
|
|
|
by leo_pekelis
4163 days ago
|
|
Hello, Leo, Optimizely's in-house statistician here. The graph you reference is a schematic to show the differences between Optimizely’s previous statistical platform and Stats Engine. It shows a monotone non-decreasing significance because under our sequential testing framework, the significance value represents the total amount of accumulated evidence against the null hypothesis of no difference between a variation and baseline. This wealth of evidence cannot decrease because you can only get more information about your test as you get more visitors. Of course, it is very possible, as the graph shows, that you will not acquire enough contradictory evidence to reach significance in a reasonable number of visitors. What would have happened if we instead looked continuously at a classical t-test, is the significance would oscillate near the significance threshold. Spurious deviations would cause multiple, contradictory declarations that the test is significant and then not. A savvy A/B tester might wait until the oscillations die down. Sequential testing is a principled, mathematical way to differentiate evidence against the null hypothesis from random oscillations in real time. It should be noted that the chance of a type I error is still controlled at 5%. You do make a good point that sometimes an A/B test will see regression over time. We have explicitly separated this out because we feel detecting a change in the underlying effect size is different from testing whether the effect is non-zero, and different statistical methods are better suited to one over the other. We have built a policy into our framework that monitors for such temporal effects and signals an A/B test is in a ‘reset’ when we discover them. In our historical database, this happened on about 4% of tests. I concede all this is a lot to get across in one graph, but we do feel that it is a good representation of how significance behaves under Stats Engine. If you would like to read more about the math behind stats engine, here is a link to a full technical article: http://pages.optimizely.com/rs/optimizely/images/stats_engin... |
|
> if we instead looked continuously at a classical t-test, is the significance would oscillate near the significance threshold
So there's your answer: the y-axis on the chart has an unlabeled different meaning for the blue line.
While I have you here Leo, can you explain why you would want to chart only the accumulated evidence for X? It's meaningless without knowing how much evidence has been accumulated for not X.