Hacker News new | ask | show | jobs
by graeme 4597 days ago
Can someone with expertise comment on this? I once worked in a company where the founders thought that the small samples were adequate. I thought that the calculators were misleading with such small samples sizes, even though they gave "high confidence".

But that was only based on my intuition, not math, and I've never seen anyone give a good discussion of whether "90% confidence" is as definitive as it sounds in the context of a very small sample.

3 comments

It's a bit awkward to give a full answer to this, but this is to the best of my understanding and explained as simply as is reasonable:

A small sample has less statistical 'power' to identify significant differences where they exist. Put another way, a large sample is more likely to give a true significant result than a small sample.

But, if you do see 10% significance(/90% confidence) in a small sample, this is just as good as 10% significance in a large sample. Although the cutoff point will be more rough in a smaller sample, it's a good standard practice to round conservatively to account for this.

10% is unlikely to be considered a good result for statistics in either case - you can engineer a result by doing 10 tests on nothing and there's a danger you would have unknowingly or unconsciously done this, maybe (for example) by not deciding the sample size in advance. However, there's also presumably strong enough evidence against a harmful difference that you aren't likely to lose anything by following these results.

It can be good idea to do numerous small investigative tests as justification for bigger tests - relying on lots of small tests alone requires consideration for multiple testing (e.g. Bonferroni correction).

"But, if you do see 10% significance(/90% confidence) in a small sample, this is just as good as 10% significance in a large sample". That is not true, strictly speaking. You are assuming that small sample describes the underlying distribution well. But this may not be the case due to non-normality of the distribution itself or potential biases
Cool point and I agree.

The sample has to represent the population, that's fundamental. If the sample is so small that it can't characterise the population distribution, then you have a problem anyway. If you're measuring a events that happen 1% of the time (or 99% of the time), a sample of 100 is not nearly enough.

If you chose an appropriate non-parametric test to cover an unknown distribution with a small sample, it maybe would have zero power (impossible to give a significant result)

There's no such thing as a "small" or "large" sample size, per se. If you're doing it rigorously, you need to fix both your confidence interval (e.g., 95%) and the effect size you expect to see (e.g., a 50% lift in metric X relative to your control). You can then do some simple math which will tell you what sample size you need before there's only a 5% chance you'll see a 50% lift in metric X if you continue the test. Finally, you run the test until you've sampled that many users and stop the test. If there's a winning variant and it's statistically significant, congrats! If not, go back to square one.

The larger the effect size, the smaller your sample size can be before you reach that conclusion.

Most folks don't fix the desired effect size and instead just create a bunch of variants, start the A/B test, wait for the A/B testing framework to shout "statistically significant!", and then declare a winning variant. If the sample size seems "too small" they might not feel comfortable declaring a winner, so they perfunctorily "get a few more samples." Neither of these are rigorous, so it's a bit pointless to debate about which one is "better."

small sample sizes are misleading. You probably need at least 100 data points for reasonable significance, but if your data is skewed or has fat tails then most likely much more than that