| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mattjack 3338 days ago

I agree with kharms

>You're describing P-value hacking

Here's an example of what can happen when you take a huge corpus of data and throw an equally huge number of hypotheses at it to see what sticks: https://io9.gizmodo.com/i-fooled-millions-into-thinking-choc...

tl;dr: he "proved" chocolate causes weight loss by comparing chocolate- and non-chocolate-eaters on a very high number of health indicators.

That also introduces the multiple testing problem: https://www.wikiwand.com/en/Multiple_comparisons_problem

The more statistical tests you run against a set of data (EDIT: the more variables you test against a dataset), the higher the chance you get a statistically significant result from random error alone.

2 comments

tedsanders 3338 days ago

The solution to false positives is not to artificially rate-limit testing or blind yourself to potentially useful data. It's to understand that 5% is an insufficient significant threshold when your prior belief in a correlation is low.

There are really three solutions to the problem of multiple comparisons: Either (1) you use a different threshold, (2) you use a different test, and/or (3) you correctly interpret that p=5% does not imply the effect is 95% likely.

There's absolutely nothing wrong with exploring a data set, as long as you are responsible in the conclusions you draw.

link

devrandomguy 3338 days ago

IANAS, but does this mean that a set of raw data loses value, as more information is extracted from it? If I use your old raw data to validate my hypothesis, does that somehow also weaken the statistical evidence for your hypothesis?

I really need to go back and study statistics, this is getting embarrassing.

link

mattjack 3338 days ago

I worded my comment incorrectly (and edited it accordingly). What I should have said is that when you run a stats test against a dataset, there's a known probability that you'll get a significant correlation simply due to chance. The more variables you examine, the higher that chance becomes.

I just found this on Google but the first page of this paper explains it a little better: http://www.stat.berkeley.edu/~mgoldman/Section0402.pdf

link

thaumasiotes 3338 days ago

It means that you can't use the same data to confirm a hypothesis as you used to generate the hypothesis. Defensible statistical practice would be to throw anything you like at the original data set, come up with whatever ridiculous idea, and then collect a new data set for the purpose of investigating your ridiculous idea. The original data set provides zero[1] evidence for a hypothesis that it inspired you to think of.

[1] Not really, but this is the cleanest way to sidestep multiple comparisons.

link

tedsanders 3338 days ago

Respectfully, I disagree.

(1) First, you can certainly have confidence in hypotheses based off single data sets. If you have a dataset with 1 million hours of TV watching that show 0 correlation between watching golf and watching Judge Judy, it's fine to suspect there's little correlation. You don't need to run a second study to have an informed opinion.

(2) Second, collecting new data sets (or equivalently blinding yourself to partitions) doesn't 100% fix the problem either. If you test lots of hypotheses against your test set, then the odds that some of them are false rises too. Creating third- and fourth- and fifth-level validation sets just keeps pushing the problem up the ladder. In fact, there's no real difference between the requirement to experimentally validate results and the requirement to have a hypothesis 'work' on both halves of a partitioned dataset. The data doesn't care when you collected it.

Ultimately we just have to admit that tests based on randomness are sometimes randomly wrong. There is no perfect silver bullet solution.

link

thaumasiotes 3338 days ago

> In fact, there's no real difference between the requirement to experimentally validate results and the requirement to have a hypothesis 'work' on both halves of a partitioned dataset.

This would be correct in the absence of investigator malfeasance. Unfortunately, investigator malfeasance is the problem we're trying to solve, so assuming it away is unwise. The requirement to collect new data imposes pretty strict limits on how many hypotheses you can test. The requirement to find a hypothesis along with a division of your existing data set such that the hypothesis holds in both halves is much more generous; it can be automated just as easily as finding a hypothesis that works in the unified data set can.

link

tedsanders 3338 days ago

Fair, but that's mitigated if you have a rule that requires an ordering of the data points (say, chronologically). Then there should be no difference between two 500-data-point studies and one 1,000-data-point study partitioned in two (uniquely determined) halves.

link

thaumasiotes 3338 days ago

This is not a solution. It removes one degree of freedom, the ability to draw the "line" dividing one half of the data set from the other. But an evil or naive scientist has limitless other degrees of freedom to choose from, and can make as many comparisons (in the "multiple comparisons" sense) as they like, undetectably to you.

After you, the good guy, have specified which half of the data is the playground and which is the confirmatory test set, Evil Scientist can still run as many hypotheses as he feels like until he finds one that validates in both halves.

Under the rule "you can only validate a hypothesis by collecting a new data set dedicated to that hypothesis", we, the observers, have a way of guaranteeing that multiple comparisons did not occur. We have no such guarantee under the system you describe.

So to sum up: the rule I describe is not necessary in order to practice good statistics for your own benefit. But it is necessary in order to have a good statistical argument for convincing someone who can't directly perceive the contents of your mind. It's an auditing tool.

link

jdmichal 3338 days ago

Obviously the data set doesn't become "weaker" or "lose value" -- it's data, and running stats against it doesn't change it.

However, every test for a correlation against a data set has some chance of yielding a false positive or false negative. This chance is called the p-value, and typically .05, or 5%, is the minimum requirement to be considered "significant". But that means that if you test for 20 or so correlations, you would expect one of them to be wrong. And the only thing that can fix that is reproducing the test with a different data set.

Searching for "science reproduction crisis" will give a lot of good results for further reading.

This topic is also what this XKCD is about -- and it's not a coincidence that there are 20 "test" frames with a .05 p-value:

https://www.xkcd.com/882/

link

tedsanders 3338 days ago

That is not the definition of a p-value. :(

A p-value of 5% means that, IF the null hypothesis is true (IF!), then there's a 5% chance of getting results as extreme as measured.

A p-value of 5% does not mean than you should expect a rate of 5% false positives & negatives.

link

jdmichal 3338 days ago

Isn't your second paragraph just a definition of a false positive?

And, it looks like power is the error rate for false negatives:

https://en.m.wikipedia.org/wiki/Statistical_power

Too late to edit my original to fix this.

link