Hacker News new | ask | show | jobs
by MrEldritch 2318 days ago
basically, because once you start trying multiple hypotheses on the same dataset, the math used to determine "is this conclusion real, or am I just fooling myself" begins to break down.

The statistical significance threshold usually used is p<0.05, meaning that something is (generally, this is beginning to change since the replication crisis) considered to be a real discovery if it has less than a 1/20 chance of being a false positive under the chosen model.

As soon as you start trying multiple hypotheses, then that 1/20 chance of being a false positive begins to become meaningless. If you can just keep rolling d20s until one of them comes up with a critical hit, then you can easily generate false positives that still look very robust.

This is exactly the sort of bad science - p-hacking, fishing expeditions, and the garden of forking paths - that led to the replication crisis. (And that makes sense, as this paper is from 2013, and predates the widespread discovery of the crisis)

3 comments

The math continues to work out as long as you use the right approach. You have to collect twice as much data, and then set half of it aside at random without examining it. Then you can do whatever perverse p-hacking multi-modeling curve-fitting whatever to the half you kept until you reach a hypothesis, then check it against the half you set aside to recover the statistical significance you lost by using techniques that may have overfit the first half. Unsurprisingly, the math works out because this approach is isomorphic to collecting the first half, studying it to form a hypothesis, then conducting a proper pre-hypothesized experiment to collect the second half. Validation via holdout sets is the same approach used in machine learning and elsewhere to prevent models from overfitting data.
This is true! I was trying to simplify things a bit for a basic explanation, but I fear I oversimplified. I just meant that the generally used math breaks down; if you're aware of the problem, you can correct for it, but very often people don't.
Stating it more plainly, what you wrote was incorrect, and unfairly tarred a statement that was, in fact, correct.
Thanks! For someone that didn't understand why this was considered p-hacking, that made a whole lot of sense.
p<0.05 is also cargo-cult science, and is much more responsible for the replication crisis -- along with biased sampling (pop. 18-22 yo US psych students).

It is also why we see repeated, spurious insistence that anti-depressants don't do anything.

Experiment design is a subtle skill.