Hacker News new | ask | show | jobs
by ncmncm 2319 days ago
Why not? Science that insists on hypotheses written down beforehand is cargo-cult science. Observation is the first and most productive science. Double-blind experiments are to cement gains.
6 comments

basically, because once you start trying multiple hypotheses on the same dataset, the math used to determine "is this conclusion real, or am I just fooling myself" begins to break down.

The statistical significance threshold usually used is p<0.05, meaning that something is (generally, this is beginning to change since the replication crisis) considered to be a real discovery if it has less than a 1/20 chance of being a false positive under the chosen model.

As soon as you start trying multiple hypotheses, then that 1/20 chance of being a false positive begins to become meaningless. If you can just keep rolling d20s until one of them comes up with a critical hit, then you can easily generate false positives that still look very robust.

This is exactly the sort of bad science - p-hacking, fishing expeditions, and the garden of forking paths - that led to the replication crisis. (And that makes sense, as this paper is from 2013, and predates the widespread discovery of the crisis)

The math continues to work out as long as you use the right approach. You have to collect twice as much data, and then set half of it aside at random without examining it. Then you can do whatever perverse p-hacking multi-modeling curve-fitting whatever to the half you kept until you reach a hypothesis, then check it against the half you set aside to recover the statistical significance you lost by using techniques that may have overfit the first half. Unsurprisingly, the math works out because this approach is isomorphic to collecting the first half, studying it to form a hypothesis, then conducting a proper pre-hypothesized experiment to collect the second half. Validation via holdout sets is the same approach used in machine learning and elsewhere to prevent models from overfitting data.
This is true! I was trying to simplify things a bit for a basic explanation, but I fear I oversimplified. I just meant that the generally used math breaks down; if you're aware of the problem, you can correct for it, but very often people don't.
Stating it more plainly, what you wrote was incorrect, and unfairly tarred a statement that was, in fact, correct.
Thanks! For someone that didn't understand why this was considered p-hacking, that made a whole lot of sense.
p<0.05 is also cargo-cult science, and is much more responsible for the replication crisis -- along with biased sampling (pop. 18-22 yo US psych students).

It is also why we see repeated, spurious insistence that anti-depressants don't do anything.

Experiment design is a subtle skill.

You seem to be under the impression that a study like this gives a hard "yes/no" answer as to whether some hypothesis is true. That is not the case, nor is it ever the case with most studies like these. Instead, you need to do some sort of statistical hypothesis test.

As other comments have pointed out, once you start testing multiple hypothesis on the same dataset, you cannot apply the same significance threshold that you would if you had just begun with a single hypothesis before observing the data. Instead, you need to apply some sort of correction that takes into account the number of hypothesis being tested:

https://en.wikipedia.org/wiki/Family-wise_error_rate#Control...

No. If you collect data and then hunt for "significant" results in it you are guaranteed to find spurious results. This is one of the most basic truths of statistics.
You are confusing hypothesis generation with hypothesis testing. Both are science, but only one is a reliable way to determine truth.
Probable claims. Not truth.
In the non-Platonic real world, truth is claims that we believe have high probability.
Not if you want to claim statistical significance. The math behind this method is based on defining the hypothesis before seeing the data (and even then it's usually very weak evidence of a tiny signal within the noise).
xkcd explains it better than I can. Basically if you pick p values that give 95% certainty 20 times you're probably going to "discover" at least one falsehood.

https://xkcd.com/882/