It's disturbing to me how p < 0.05 is used somewhat unthinkingly as the test for statistical significance simply because it's ubiquitous in science.
It seems to me if you have even a somewhat popular app, you're gathering enough data that you can afford to use p < 0.001 and avoid a lot of the complexities of statistical analysis that comes from p < 0.05. If you don't have enough data to reach p < 0.001, it's probably better to work more on increasing traffic than getting the piddling gains from A/B testing so early.
People blindly stopping at 0.05 is doubly worrying given that people tend to stop an A/B test as soon as it shows significance. That gives them multiple chances to be wrong. Furthermore if you are getting close to significance very fast, then strong significance is close behind, so why not wait?
That said, if a test has been running for a while and you don't have an answer, it can run for a looong time before it finishes. In my A/B testing tutorial I explored that starting at http://www.elem.com/~btilly/effective-ab-testing/#slide59 (just use the arrow keys to move forwards and backwards through those slides). I found that depending on whether random fluctuations took you in the same direction as the underlying bias or the opposite, there tends to be an order of magnitude difference in how long the test takes to run. Furthermore whichever is leading after many observations is usually really better, and in the worst case is overwhelmingly likely to not be much worse. Therefore there are times when it really is better to declare an answer and move on.
If you wish to formalize this, you could use the strategy used by some medical trials where they decide in advance what confidence levels will cause them to cut off early after 100 trials, 1000, trials, 10,000 trials, or to go to (say) 50,000 trials. And then they arrange that the sum of the odds that they make an early mistake are below some acceptable threshold.
Another common issue he doesn't mention: using observed differences (or observed significance-test values) as the stopping criterion. The common statistical-significance tests don't work if the decision when to stop collecting data is dependent on the observed levels of significance. Instead, you must ahead of time decide how many trials to do, and stick to that decision, or use more complicated significance tests. (This is the "multiple testing" problem.)
For example, it works to flip two coins 50 times each, and then run a statistical-significance test. It does not work to flip two coins 50 times each, run a test; if no significance yet, continue to 100, then 150, etc. until you either find a significant difference or give up. That greatly increases the chance that you'll get a spurious significance, because your stopping is biased in favor of answering "yes": if you found a difference at 50, you don't go on to 100 (where maybe the difference would disappear again), but if you didn't find a difference at 50, you do go on to 100.
Put differently, it's using separate p-values for "what is the chance I could've gotten this result in [50|100|150|...] trials with unweighted coins?" to reject the null hypothesis each time, as if they were independent, but the null hypothesis for the entire series has to be the union, "what is the chance I could've seen this result at any of the 50, 100, 150, or 200, ... stopping points with unweighted coins?", which is higher. Yet that's exactly how many A/B tests are done: you start collecting data, and let the trials run until you find "significant" differences or give up.
(It's possible to set up a series of tests where you choose when to stop based on observed values, but you have to use different statistical machinery than the common significance-tests.)
Wait, so people who do A/B tests didn’t already do that? It drives me absolutely crazy when I don‘t have any measure to assess how likely or unlikely it is for some difference to be random.
No, people who do A/B tests have known this for years. It is the wannabes who haven't sat down and figured out the statistics who run into trouble. See http://elem.com/~btilly/effective-ab-testing/ for an OSCON tutorial that I did on the topic a couple of years ago, which includes all the gory statistical detail you could want.
Furthermore I note with interest that 2 of the 3 statistical techniques he named (Student's t test and ANOVA) only apply to cases where the observed variables are themselves normally distributed. Which is not a good description of binary yes/no outcomes. As for the remaining test, it is appropriate to use a chi-square, but statisticians tell us that the g-test is preferable.
The total is indeed nearly normally distributed, but the rate of convergence (particularly in the tails) is not fast enough to avoid having those very sensitive tests give wrong results.
Were it otherwise there would have been no need to develop the chi-square test. It would have been entirely redundant. (It actually is redundant because we have the g-test. But evaluating the chi-square test just involves taking squares, while the g-test involves taking natural logarithms. This made the less accurate chi-square test much easier to do when people didn't have computers to calculate it on. Today we should use the g-test, but few people have heard of it.)
Ah, right. I spent a while drawing up a proper plot of the likelihood of the difference and the normal approximation of the difference, and saw that the normal had too small a variance. The effect is still pretty credible in the OP, though.
Noprocrast caught out my attempt to edit. The normal variance is too large.
Here's the plot. Black for discretized(n=1000) binomial likelihood, red for normal approximation. The effect is clear, but a t-test won't show it. I'm not familiar with the theory behind the g-test, but there's clearly a lot of room for improvement at these sample sizes.
I was quite surprised to find that the linked website designed to showcase A/B tests doesn't even hint at things like statistical significance or confidence intervals for the improvements
I think most people do not do this because they do not know it is important, or like me, they do not understand the theory behind it. Neither did I know how to do it in practice.
Many people think they will get will become millionaires if they follow the style of person X.
Person X is like a trial in which a coin was tossed 10000 times and got 6000 heads.
Since there is no information about the others persons, the others trials, many choose to follow the illogical thinking that they will succeed in the same way.
It seems to me if you have even a somewhat popular app, you're gathering enough data that you can afford to use p < 0.001 and avoid a lot of the complexities of statistical analysis that comes from p < 0.05. If you don't have enough data to reach p < 0.001, it's probably better to work more on increasing traffic than getting the piddling gains from A/B testing so early.