Hacker News new | ask | show | jobs
by eloff 3924 days ago
No, this is wrong. With small sample sizes you may get a statistically significant result, but it still might not be a real result and might not be reproducible. This is a major issue in science today and why a lot of studies can't be replicated.
2 comments

> No, this is wrong. With small sample sizes you may get a statistically significant result, but it still might not be a real result and might not be reproducible. This is a major issue in science today and why a lot of studies can't be replicated.

Reproducability indeed is a major problem, but looking at statistical significance alone isn't the cure (especially if applied a posterior).

We should rather look at effect sizes and robust study designs.

In fact, modern studies aiming for causality often calculate the population size needed for statistical significance beforehand. It's a standard formula in most textbooks. You only need the expected effect size and then can calculate the population needed to guarantee significance.

Statistically significant means statistically significant and is independent of sample size. If your p-value is less than 0.01, then there's less than a 1% chance that the pattern you're seeing is due to random fluctuations of the variable itself that you cannot predict.

The problem is that the statistical model (in my field we do a lot of ANOVA and t-tests, along with the occasional chi-square) can only account for what you model. So there could be some kind of systematic error that influences your results in a fashion that is not modeled by the statistics. Having a large-N study makes it harder to have that systematic error (but not impossible - as an example: look at complaints about how much psychological and cognitive science research is only on WEIRD subjects - western, educated, industrial, rich, developed).

The other problem, of course, is that one time in a hundred, you'll get a p < 0.01 significant result by chance. Which is a lot in the long run. Worse, you can induce type two errors by running hundreds of trials (or testing hundreds of variables) and not accounting for that - just pick the one thing that had significant results on a single test. This approach is unscrupulous, but not unheard of in academic circles where you need to publish tons of work to get promoted.

> If your p-value is less than 0.01, then there's less than a 1% chance that the pattern you're seeing is due to random fluctuations of the variable itself that you cannot predict.

This is a dangerous misinterpretation of p values, which cannot provide that kind of information. A p value assumes the pattern is due to random fluctuations, and asks how common this kind of fluctuation is.

Typically the chance the result is a random fluctuation is much higher; for examples, see http://www.statisticsdonewrong.com/p-value.html

That's actually a more articulate, but redundant codicil to the argument I made in the rest of the post. Multiple tests will result in significance at some alpha, since you just have to test enough times to get a lucky test. There are techniques (outlined in your link), for addressing that, but the central point I think is still cogent.

If you have a test of significance that results in p < 0.01, there's a one percent chance that you're rejecting the null hypothesis due to normally-distributed variation in your data. The base rate fallacy is more about interpreting what that p = 0.01 means, and why systematic bias is important to worry about - if you're testing cancer drugs, you don't want to test them on people who don't have cancer.

> If you have a test of significance that results in p < 0.01, there's a one percent chance that you're rejecting the null hypothesis due to normally-distributed variation in your data.

No, this is absolutely not true. If p < 0.01, then if there is no systematic effect and only normally-distributed variation, you would see this effect 1% of the time. That is, the p is P(data | null is true), and not P(null is true | data). You cannot invert the conditional.

In the extreme case, when the null is true for every test, you will get significant results for 5% of them. Thus 100% of your statistically significant results are false positives, no matter how small their p values.

Given that we do not know what fraction of the time the null is true, we cannot know the chance that we're rejecting the null falsely. But it is invariably larger than p.

This misunderstanding is why scientists routinely overestimate the strength of their evidence and discount the possibility that their results may be flukes.

(Source: I wrote the link provided earlier. Also, the discussion leading to table 1 in this paper is good http://journals.plos.org/plosmedicine/article?id=10.1371/jou...)