Hacker News new | ask | show | jobs
by pmiller2 4275 days ago
The red flag here for me was that Optimizely encourages you to stop the test as soon as it "reaches significance." You shouldn't do that. What you should do is precalculate a sample size based on the statistical power you need, which involves determining your tolerance for the probability of making an error and on the minimum effect size you need to detect. Then, you run the test to completion and crunch the numbers afterward. This helps prevent the scenario where your page tests 18% better than itself by minimizing probability that your "results" are just a consequence of a streak of positive results in one branch of the test.

I was also disturbed that the effect size was taken into account in the sample size selection. You need to know this before you do any type of statistical test. Otherwise, you are likely to get "positive" results that just don't mean anything.

OTOH, I wasn't too concerned that the test was a one-tailed test. Honestly, in a website A/B test, all I really am concerned about is whether my new page is better than the old page. A one-tailed test tells you that. It might be interesting to run two-tailed tests just so you can get an idea what not to do, but for this use I think a one-tailed test is fine. It's not like you're testing drugs, where finding any effect, either positive or negative, can be valuable.

I should also note that I only really know enough about statistics to not shoot myself in the foot in a big, obvious way. You should get a real stats person to work on this stuff if your livelihood depends on it.

4 comments

Hi pmiller, Dan from Optimizely here. Thanks for your thoughtful response. This is a really important issue for us, so I wanted to set the record straight on a couple of points:

#1 - “Optimizely encourages you to stop the test as soon as it reaches ‘statistical significance.’” - This actually isn’t true. We recommend you calculate your sample size before you start your test using a statistical significance calculator and waiting until you reach that sample size before stopping your test. We wrote a detailed article about how long to run a test, here: https://help.optimizely.com/hc/en-us/articles/200133789-How-...

We also have a sample size calculator you can use, here: https://www.optimizely.com/resources/sample-size-calculator

#2 - Optimizely uses a one-tailed test, rather than a 2-tailed test. - This is a point the article makes and it came up in our customer community a few weeks ago. One of our statisticians wrote a detailed reply, and here’s the TL;DR:

- Optimizely actually uses two 1-tailed tests, not one.

- There is no mathematical difference between a 2-tailed test at 95% confidence and two 1-tailed tests at 97.5% confidence.

- There is a difference in the way you describe error, and we believe we define error in a way that is most natural within the context of A/B testing.

- You can achieve the same result as a 2-tailed test at 95% confidence in Optimizely by requiring the Chance to Beat Baseline to exceed 97.5%.

- We’re working on some exciting enhancements to our methodologies to make results even easier to interpret and more meaningfully actionable for those with no formal Statistics background. Stay tuned!

Here’s the full response if you’re interested in reading more: http://community.optimizely.com/t5/Strategy-Culture/Let-s-ta...

Overall I think it’s great that we’re having this conversation in a public forum because it draws attention to the fact that statistics matter in interpreting test results accurately. All too often, I see people running A/B tests without thinking about how to ensure their results are statistically valid.

Dan

Thanks for replying. I agree with all the points you mention your statistician covered, but you should make sure your users know what kind of test you're using. The only reason I say this is because this article gives me the impression that you were using a single one-tailed test (which, as I said in my post, is a perfectly acceptable thing to do in the context of web site A/B testing).

But, as far as "Optimezely encourages you to stop the test as soon as it reaches 'statistical significance,'" I'm not saying your user documentation or anything encourages people to stop tests early. I'm saying (and this is based only on the article as I've never used Optimizely) that your platform is psychologically encouraging users to stop tests early. E.g. from the article:

    Most A/B testing tools recommend terminating tests as soon as they show significance, even though that significance may very well be due to short-term bias. A little green indicator will pop up, as it does in Optimizely, and the marketer will turn the test off.

    <image with a green check mark saying "Variation 1 is beating Variation 2 by 18.1%">

    But most tests should run longer and in many cases it’s likely that the results would be less impressive if they did. Again, this is a great example of the default settings in these platforms being used to increase excitement and keep the users coming back for more.
I am aware of literature in experimental design that talks about criteria for stopping an experiment before its designed conclusion. Such things are useful in, say, medical research, where if you see a very strong positive or negative result early on, you want to have that safety valve to either get the drug/treatment to market more quickly or to avoid hurting people unnecessarily.

Unless you've built that analysis into when you display your "success message" that "Variation 1 is beating Variation 2 by 18.1%," I'd argue that you're doing users a disservice. When I see that message, I want to celebrate, declare victory, and stop the test; and that's not what you should encourage people to do unless it's statistically sound to do so.

The other thing in the article that lead me to this position is that you display "conversion rate over time" as a time series graph. Again, if I see that and I notice one variation is outperforming the other, what I want to do is declare victory and stop the test. That might not be mathematically/statistically warranted.

IMO, as a provider of statistical software, I think you'd do your users a service to not display anything about a running experiment by default until it's either finished or you can mathematically say it's safe to stop the trial. Some people will want their pretty graphs and such, so give them a way to see them, but make them expend some effort to do so. Same thing with prematurely ended experiments; don't provide any conclusions based on an incomplete trial. Give users the ability to download the raw data from a prematurely ended experiment, but don't make it easy or the default.

For a second I thought you were Evan Miller who wrote about the exact same thing: http://www.evanmiller.org/how-not-to-run-an-ab-test.html
No, I'm not him, but stuff like fixing a sample size in advance and not stopping tests early without careful analysis are things I learned in my stat classes. This stuff should be stressed in any good intro stats class covering hypothesis testing. (I was a math major in college, so I had all of 2 courses in mathematical statistics and 0 in experimental design. I didn't go to the best school, but my stats teacher was a former industry statistician focusing on quality control.)
"Honestly, in a website A/B test, all I really am concerned about is whether my new page is better than the old page. A one-tailed test tells you that."

No, it's the other way around. One tailed test is only usable for testing if the new design worse than the old one, because it being better than the old one does not matter as long it's not worse. If you are testing that is the new design better, you definitely need to test both tails or else you may likely switch to a worse design than the old one.

More precisely, before you start the test you need to choose a "default" choice. If the default choice is the old version, then it's safe to switch to the new version provided it isn't worse. Apply the converse if your default choice is the new version.

The key point here is that you aren't choosing a testing procedure, you are choosing a decision procedure.

Frequentism rears its ugly head again...
This problem exists with Bayesian techniques also, its just more obvious how to set up the problem.
Exactly! The problems arise because of the disconnect between what the math is actually saying and what people think the math is saying. Or rather: what people wish it was saying. Frequentist methods give you "if page A performs the same as page B then then the likelihood of observing something at least as extreme as this measurement is less than X%". In practice we never want to know this information. What people actually want to know is "given this measurement, the probability of page A being better than page B is X%", so they interpret whatever number comes out of the frequentist method like that...wishful thinking.

Just give them 2 posterior distributions of the conversion rate of page A and page B. It may look more daunting than a single number at first, but it's much easier to interpret than that single number that comes out of hypothesis testing, and, you know, it's the information they actually need to make a decision whether to pick page A or page B.

"given this measurement and our prior beliefs, the probability of page A being better than page B is X%"

FTFY ;). I think Bayesian methods add a lot of interpretive power, but I'm not sure that it would help people make a correct interpretation. I suspect that if practitioners are neglecting the difference between a one-sided and two-sided test, they will likely forget (or gloss over) what priors are (and their non-trivial implementation).

I definitely agree that their is a disconnect between the math and its interpretation, though.

Even in the Bayesian case, you need more than 2 posteriors. You need a decision rule. Comparing posteriors is not sufficient.

http://www.bayesianwitch.com/blog/2014/bayesian_ab_test.html

Why not run a two-tailed test and double the alpha? If I'm understanding it correctly, you'll still make the same conclusion at either tail as a one-tailed test, but this way you have both directions covered. I could be missing something, just thinking out loud.