Hacker News new | ask | show | jobs
by dsiroker 4275 days ago
Hi pmiller, Dan from Optimizely here. Thanks for your thoughtful response. This is a really important issue for us, so I wanted to set the record straight on a couple of points:

#1 - “Optimizely encourages you to stop the test as soon as it reaches ‘statistical significance.’” - This actually isn’t true. We recommend you calculate your sample size before you start your test using a statistical significance calculator and waiting until you reach that sample size before stopping your test. We wrote a detailed article about how long to run a test, here: https://help.optimizely.com/hc/en-us/articles/200133789-How-...

We also have a sample size calculator you can use, here: https://www.optimizely.com/resources/sample-size-calculator

#2 - Optimizely uses a one-tailed test, rather than a 2-tailed test. - This is a point the article makes and it came up in our customer community a few weeks ago. One of our statisticians wrote a detailed reply, and here’s the TL;DR:

- Optimizely actually uses two 1-tailed tests, not one.

- There is no mathematical difference between a 2-tailed test at 95% confidence and two 1-tailed tests at 97.5% confidence.

- There is a difference in the way you describe error, and we believe we define error in a way that is most natural within the context of A/B testing.

- You can achieve the same result as a 2-tailed test at 95% confidence in Optimizely by requiring the Chance to Beat Baseline to exceed 97.5%.

- We’re working on some exciting enhancements to our methodologies to make results even easier to interpret and more meaningfully actionable for those with no formal Statistics background. Stay tuned!

Here’s the full response if you’re interested in reading more: http://community.optimizely.com/t5/Strategy-Culture/Let-s-ta...

Overall I think it’s great that we’re having this conversation in a public forum because it draws attention to the fact that statistics matter in interpreting test results accurately. All too often, I see people running A/B tests without thinking about how to ensure their results are statistically valid.

Dan

1 comments

Thanks for replying. I agree with all the points you mention your statistician covered, but you should make sure your users know what kind of test you're using. The only reason I say this is because this article gives me the impression that you were using a single one-tailed test (which, as I said in my post, is a perfectly acceptable thing to do in the context of web site A/B testing).

But, as far as "Optimezely encourages you to stop the test as soon as it reaches 'statistical significance,'" I'm not saying your user documentation or anything encourages people to stop tests early. I'm saying (and this is based only on the article as I've never used Optimizely) that your platform is psychologically encouraging users to stop tests early. E.g. from the article:

    Most A/B testing tools recommend terminating tests as soon as they show significance, even though that significance may very well be due to short-term bias. A little green indicator will pop up, as it does in Optimizely, and the marketer will turn the test off.

    <image with a green check mark saying "Variation 1 is beating Variation 2 by 18.1%">

    But most tests should run longer and in many cases it’s likely that the results would be less impressive if they did. Again, this is a great example of the default settings in these platforms being used to increase excitement and keep the users coming back for more.
I am aware of literature in experimental design that talks about criteria for stopping an experiment before its designed conclusion. Such things are useful in, say, medical research, where if you see a very strong positive or negative result early on, you want to have that safety valve to either get the drug/treatment to market more quickly or to avoid hurting people unnecessarily.

Unless you've built that analysis into when you display your "success message" that "Variation 1 is beating Variation 2 by 18.1%," I'd argue that you're doing users a disservice. When I see that message, I want to celebrate, declare victory, and stop the test; and that's not what you should encourage people to do unless it's statistically sound to do so.

The other thing in the article that lead me to this position is that you display "conversion rate over time" as a time series graph. Again, if I see that and I notice one variation is outperforming the other, what I want to do is declare victory and stop the test. That might not be mathematically/statistically warranted.

IMO, as a provider of statistical software, I think you'd do your users a service to not display anything about a running experiment by default until it's either finished or you can mathematically say it's safe to stop the trial. Some people will want their pretty graphs and such, so give them a way to see them, but make them expend some effort to do so. Same thing with prematurely ended experiments; don't provide any conclusions based on an incomplete trial. Give users the ability to download the raw data from a prematurely ended experiment, but don't make it easy or the default.