|
|
|
|
|
by pmiller2
4275 days ago
|
|
The red flag here for me was that Optimizely encourages you to stop the test as soon as it "reaches significance." You shouldn't do that. What you should do is precalculate a sample size based on the statistical power you need, which involves determining your tolerance for the probability of making an error and on the minimum effect size you need to detect. Then, you run the test to completion and crunch the numbers afterward. This helps prevent the scenario where your page tests 18% better than itself by minimizing probability that your "results" are just a consequence of a streak of positive results in one branch of the test. I was also disturbed that the effect size was taken into account in the sample size selection. You need to know this before you do any type of statistical test. Otherwise, you are likely to get "positive" results that just don't mean anything. OTOH, I wasn't too concerned that the test was a one-tailed test. Honestly, in a website A/B test, all I really am concerned about is whether my new page is better than the old page. A one-tailed test tells you that. It might be interesting to run two-tailed tests just so you can get an idea what not to do, but for this use I think a one-tailed test is fine. It's not like you're testing drugs, where finding any effect, either positive or negative, can be valuable. I should also note that I only really know enough about statistics to not shoot myself in the foot in a big, obvious way. You should get a real stats person to work on this stuff if your livelihood depends on it. |
|
#1 - “Optimizely encourages you to stop the test as soon as it reaches ‘statistical significance.’” - This actually isn’t true. We recommend you calculate your sample size before you start your test using a statistical significance calculator and waiting until you reach that sample size before stopping your test. We wrote a detailed article about how long to run a test, here: https://help.optimizely.com/hc/en-us/articles/200133789-How-...
We also have a sample size calculator you can use, here: https://www.optimizely.com/resources/sample-size-calculator
#2 - Optimizely uses a one-tailed test, rather than a 2-tailed test. - This is a point the article makes and it came up in our customer community a few weeks ago. One of our statisticians wrote a detailed reply, and here’s the TL;DR:
- Optimizely actually uses two 1-tailed tests, not one.
- There is no mathematical difference between a 2-tailed test at 95% confidence and two 1-tailed tests at 97.5% confidence.
- There is a difference in the way you describe error, and we believe we define error in a way that is most natural within the context of A/B testing.
- You can achieve the same result as a 2-tailed test at 95% confidence in Optimizely by requiring the Chance to Beat Baseline to exceed 97.5%.
- We’re working on some exciting enhancements to our methodologies to make results even easier to interpret and more meaningfully actionable for those with no formal Statistics background. Stay tuned!
Here’s the full response if you’re interested in reading more: http://community.optimizely.com/t5/Strategy-Culture/Let-s-ta...
Overall I think it’s great that we’re having this conversation in a public forum because it draws attention to the fact that statistics matter in interpreting test results accurately. All too often, I see people running A/B tests without thinking about how to ensure their results are statistically valid.
Dan