| HN Mirror

> It shows a monotone non-decreasing significance because [the value] represents the total amount of accumulated evidence against the null hypothesis.

> if we instead looked continuously at a classical t-test, is the significance would oscillate near the significance threshold

So there's your answer: the y-axis on the chart has an unlabeled different meaning for the blue line.

While I have you here Leo, can you explain why you would want to chart only the accumulated evidence for X? It's meaningless without knowing how much evidence has been accumulated for not X.

One point of clarification, the y-axis on the chart does have the same meaning for both lines. It is 1 minus the chance of committing a type I error. I think you do point out an important nuance that under sequential testing a type I error changes to “ever detecting a significant result on an insignificant test” instead of just at one, predetermined visitor count.

The amount of accumulated evidence for X is exactly a p-value, or a measurement which can tell you if there is enough evidence in the experiment to contradict a hypothesis of “no difference between a baseline and variation.” A high p-value, or low significance tells you there is a lack of evidence to make this claim.

You bring up a very interesting point which is that with sequential testing it is actually possible to also look for evidence of ‘not X’ or that there really is no detectable difference. This works by ‘flipping the hypothesis test on it’s head’ and allows for a mathematical formulation of stopping early for futility. We do not currently offer this in Stats Engine because we believe it’s the less important quantity of the two, but it may be the focus of future research.

Sorry for the delay in responding. HN thinks we are responding too fast to comments over here. =)

You keep using that word "classical." If by "classical" you mean frequentist, then sequential testing is the appropriate frequentist method to have been using all along. If by "classical," you mean "old and established," then sequential testing is still the appropriate method to have been using all along.