Hacker News new | ask | show | jobs
by loup-vaillant 4536 days ago
> But having a small sample size doesn't make it any more likely to find a false positive.

It does. Try and test a die for load. Let's say your prior probability of the dice being loaded is 50%, because this is a real shady place you're gambling in. You further know (based on the game you're playing) that if your die is loaded, it will land with these frequencies:

  1: 1/3 of the time.
  2: 1/6 of the time.
  3: 1/6 of the time.
  4: 1/6 of the time.
  5: 1/6 of the time.
  6: almost never
Now, you will throw the die on the table a number of times to test it for load. Each throw will give you some evidence. If I've got my calculations correct, landing a 6 nearly guarantees the die isn't loaded, landing one gives you 1 bit of evidence that it's loaded, and landing anything else doesn't tell you anything.

Now what is the probability for false positive? Well… With only one throw, you will land 1 one times out of six, giving you a posterior probability distribution of 2/3 loaded, 1/3 genuine (this is as close as you will get to a false positive).

With 2 throws, it's a bit more complicated:

  1    , 1    :  1/36 : loaded with 80% probability
  1    , [2-5]:  8/36 : loaded with 67% probability
  6    , [1-6]: 11/36 : definitely genuine
  [2-5], [2-5]: 16/36 : no evidence
And so on, as you throw the die over and over again. I'll spare you the calculations, but the simple thing is, the die will get more and more chances to eventually land a 6, rendering the "definitely genuine" observation more and more probable (1 - (5/6)^number_of_throws), and the false positives less and less believable.

Okay, this is a contrived example. But sufficiently large sample sizes do indeed reduce the risk of false positives. It's just that some result are so clear cut that they don't need large sample sizes to reach a conclusion reliably.

1 comments

You're getting "false positive" with the method you've chosen, but it's not a method that would be accepted in a scientific paper as evidence for an experimental effect. Maybe your method is more appropriate for, say, a machine learning context, but it's not what would be used in a paper like this.

First, the statistical tests used for these experiments don't make use of Bayesian stats, so the prior 50%-loaded probability simply isn't factored in. The standard is to use null-hypothesis testing, which says roughly, that if the null hypothesis is true -- that is, if there is no actual difference between the populations (experimental groups A and B, for example) -- what is the probability that you'd see a pattern like the one observed in the data. And the tests take sample size into account in calculating this probability.

If you throw the die once, the test that you'd use here (Chi-square) would _never_ give you a false positive, that is a p-value of <.05. With small samples, there is too little power to get a the requisite p-value. (And I'll note that Chi-square is one of the tests used in these papers.)

There's a whole other debate about whether p-values and null hypothesis tests are the right thing to use, whether the standard 0.05 threshold p-value is small enough, whether Bayesian stats should be used, etc. These are legitimate issues. But they're separate from the claim that small samples will increase the likelihood of a false positive.

Standards statistics are erroneous. Bayesian statistics are correct. End of story.

(I know of the debates. For all I care Bayesians have won by an overwhelming margin. The only advantage of Frequentist statistics is their relative ease of use. But in the search for truth, you just can't escape Probability Theory. Period. My method wouldn't be accepted in a paper? Then fuck the papers. I'm not trying to get published, I'm trying to get to the truth.)

I don't have the proof nailed down, but based on the examples I can come up with, I'm extremely confident that as long as you use probability theory correctly, small sample sizes do increase the chance of false positives. On the other hand, those false positives will be weaker than the exceptional false positive you might get from larger sample sizes. (Imagine I throw the dice 30 times, and I get zero 6 and 10 ones? It's very rare, but it would make me all the more confident the die is loaded.) If you use that crappy outdated Frequentist junk, however, all bets are off.

---

Note however that in a sense, you are correct: by conservation of expected evidence, the weighted average of evidence you expect is exactly zero: if it were not, you would already have changed your belief at the point of equilibrium. Which means that if you expect lots of weak evidence in one direction, you also expect a little, and very strong, evidence on the other side.

I'm not sure this is what you where getting at, though.

---

When we do null-hypothesis testing, we do assume a prior: using smaller p-values means we're more skeptics towards the competing hypothesis —we have a stronger prior belief for their fallacy. But we don't speak the word "prior", so we can pat ourselves on the back for our "objectivity", and scold the Bayesian for his "subjectivity". Priors, what arrogance. Who is he to believe so and so in the first place? We do science, not faith.

Only we're blind to our own priors.