| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by AstralStorm 442 days ago

The practice you describe is called data dredging though. The thing about it is that you do not know enough experimental design details to make sure it was all on the up, especially worse the older the dataset gets.

Normally when doing that you need a multiple comparison corrections and conservative stats. That won't get you published though, or if you do get published you won't get noticed except by someone running a meta analysis. Perhaps not even then. Usually you end up with negative results from reanalysis, evidence of tampering or small effect sizes.

And this does not that reliably detect dataset manipulation, p hacking on the part of experimenters or accidental violations of the protocol, not even necessarily if the data collection included measures to prevent it.

In short: you cannot 100% trust any dataset you did not make. Not even as part of the team that makes it.

2 comments

nlitened 442 days ago

If you "dredge" any data set (even the one you can 100% trust) over and over with random hypotheses until p-value is <0.05, you will eventually (actually, pretty quickly) support some false hypothesis. That's why "data dredging" is also p-hacking.

link

karma_fountain 442 days ago

Yes, as I understand it there is bias inherent in any dataset due to the fact it is a sample. Data dredging is just looking for that bias. You could do that, but then you'd have to confirm with a new experiment.

link

TeeMassive 442 days ago

The bias towards positive hypotheses is a consequence of the lack of fundamental discoveries. Most scientific researchers at this point are publicly funded engineering projects with no expected ROI. This is not a bad thing per se, but the culture of research based around making an impression in some noble's court is no longer viable. The incentives need to be shifted to good research and good methodology and need to be results agnostic.

link