Hacker News new | ask | show | jobs
by ryanglasgow 4596 days ago
Interesting read, but I would have to disagree. It's not difficult to reach 90% confidence with very a small sample size:

  - Variation A and B each receive 20 visits
  - Variation A receives 10 clicks while variation B receives 5 clicks
  - The confidence interval for Variation A is 90%
  (Source: https://mixpanel.com/labs/split-test-calculator)
Also, I wrote an article titled "Creating Successful Product Flows" that is very relevant to this post: https://medium.com/design-startups/c41ffbce49a1
5 comments

Of course if you are A/B testing something which doubles conversions from 25% to 50% (100% improvement) you'll know quickly. However, if you're looking at something which is better by something more realistic like taking conversions from 5% to 5.5%, you're looking at around 10000 visits each for 90% confidence.
A startup isn't looking to make tiny .5% increment improvements so I don't see how this is relevant. Companies looking to grow a small user base are making significant changes, seeking significant improvements.
Your average well-crafted sales page on the internet has a conversion rate of 2.5%. A 0.5% increment is a HUGE difference. You're lucky if you get a 0.2% increment after an extensive A/B test.
Two things here. First, 90% confidence isn't great, I look for 99% confidence in running tests. Second, this assumes there is a lot of stuff you can test that produces 2x gains when in reality the number of things that do that is very small.

Its fair to A/B test things you expect to produce high leverage changes. That was actually part of the point of the article, no small tests. Focus here first, consumer psych helps you figure out where these opportunities are.

Once you get through these big opportunities though even respectable gains (e.g. 10%) take a lot of traffic to measure. For example, seeing a 10% gain in a 50% conversion rate takes around 2500-3000 visits to A/B test at 99% confidence. Seeing a 10% gain in a 10% conversion rate at 99% confidence takes 10 times more traffic than that.

> Two things here. First, 90% confidence isn't great, I look for 99% confidence in running tests.

Why? Why are you so worried about controlling false positives that you're willing to eat a whole bunch of false negatives?*

You're not administering expensive drugs to cancer patients, you're designing a website! If you mistakenly think that green buttons perform better than blue buttons when the actual truth is the null hypothesis that they perform the same, that's not the end of the world.

* and I do mean a whole bunch; in that scenario, moving from alpha=10% to alpha=1% means you increase your false negatives by something like 3x. The power calculations:

    R> power.prop.test(n=20, p1=0.5, p2=0.25, sig.level=0.10)
    ...
              power = 0.4951
    ...
    R>
    R> power.prop.test(n=20, p1=0.5, p2=0.25, sig.level=0.01)
    ...
              power = 0.1646
    ...
    R>
    R> 0.4951/0.1646
    [1] 3.008
There will be times when you make a change to a page and the difference in reception between the two pages is as stark as the situation you described above where out of 20 clicks, one page does twice as well. But most often there is a very minor difference between the click rate of the two pages, like less than 1%. In that case, you need a much larger samples size.

And even if you do get lucky and get a test like the one you described above, chances are, you want to continue to revise the page and make more subtle changes which will mean you need a much larger sample size even to reach the low bar of 90% confidence.

Can someone with expertise comment on this? I once worked in a company where the founders thought that the small samples were adequate. I thought that the calculators were misleading with such small samples sizes, even though they gave "high confidence".

But that was only based on my intuition, not math, and I've never seen anyone give a good discussion of whether "90% confidence" is as definitive as it sounds in the context of a very small sample.

It's a bit awkward to give a full answer to this, but this is to the best of my understanding and explained as simply as is reasonable:

A small sample has less statistical 'power' to identify significant differences where they exist. Put another way, a large sample is more likely to give a true significant result than a small sample.

But, if you do see 10% significance(/90% confidence) in a small sample, this is just as good as 10% significance in a large sample. Although the cutoff point will be more rough in a smaller sample, it's a good standard practice to round conservatively to account for this.

10% is unlikely to be considered a good result for statistics in either case - you can engineer a result by doing 10 tests on nothing and there's a danger you would have unknowingly or unconsciously done this, maybe (for example) by not deciding the sample size in advance. However, there's also presumably strong enough evidence against a harmful difference that you aren't likely to lose anything by following these results.

It can be good idea to do numerous small investigative tests as justification for bigger tests - relying on lots of small tests alone requires consideration for multiple testing (e.g. Bonferroni correction).

"But, if you do see 10% significance(/90% confidence) in a small sample, this is just as good as 10% significance in a large sample". That is not true, strictly speaking. You are assuming that small sample describes the underlying distribution well. But this may not be the case due to non-normality of the distribution itself or potential biases
Cool point and I agree.

The sample has to represent the population, that's fundamental. If the sample is so small that it can't characterise the population distribution, then you have a problem anyway. If you're measuring a events that happen 1% of the time (or 99% of the time), a sample of 100 is not nearly enough.

If you chose an appropriate non-parametric test to cover an unknown distribution with a small sample, it maybe would have zero power (impossible to give a significant result)

There's no such thing as a "small" or "large" sample size, per se. If you're doing it rigorously, you need to fix both your confidence interval (e.g., 95%) and the effect size you expect to see (e.g., a 50% lift in metric X relative to your control). You can then do some simple math which will tell you what sample size you need before there's only a 5% chance you'll see a 50% lift in metric X if you continue the test. Finally, you run the test until you've sampled that many users and stop the test. If there's a winning variant and it's statistically significant, congrats! If not, go back to square one.

The larger the effect size, the smaller your sample size can be before you reach that conclusion.

Most folks don't fix the desired effect size and instead just create a bunch of variants, start the A/B test, wait for the A/B testing framework to shout "statistically significant!", and then declare a winning variant. If the sample size seems "too small" they might not feel comfortable declaring a winner, so they perfunctorily "get a few more samples." Neither of these are rigorous, so it's a bit pointless to debate about which one is "better."

small sample sizes are misleading. You probably need at least 100 data points for reasonable significance, but if your data is skewed or has fat tails then most likely much more than that
> It's not difficult to reach 90% confidence with very a small sample size:

I think the difficulty in reaching 90% confidence is in designing a challenger that is THAT much better the original (i.e. 10 vs 5). Most split tests are shots in the dark. You'll basically need a design or copy that is doing pretty bad and an a challenger that is a lot better (but not obviously good enough that you use it in the first place).