Hacker News new | ask | show | jobs
by tedsanders 3338 days ago
Respectfully, I disagree.

(1) First, you can certainly have confidence in hypotheses based off single data sets. If you have a dataset with 1 million hours of TV watching that show 0 correlation between watching golf and watching Judge Judy, it's fine to suspect there's little correlation. You don't need to run a second study to have an informed opinion.

(2) Second, collecting new data sets (or equivalently blinding yourself to partitions) doesn't 100% fix the problem either. If you test lots of hypotheses against your test set, then the odds that some of them are false rises too. Creating third- and fourth- and fifth-level validation sets just keeps pushing the problem up the ladder. In fact, there's no real difference between the requirement to experimentally validate results and the requirement to have a hypothesis 'work' on both halves of a partitioned dataset. The data doesn't care when you collected it.

Ultimately we just have to admit that tests based on randomness are sometimes randomly wrong. There is no perfect silver bullet solution.

1 comments

> In fact, there's no real difference between the requirement to experimentally validate results and the requirement to have a hypothesis 'work' on both halves of a partitioned dataset.

This would be correct in the absence of investigator malfeasance. Unfortunately, investigator malfeasance is the problem we're trying to solve, so assuming it away is unwise. The requirement to collect new data imposes pretty strict limits on how many hypotheses you can test. The requirement to find a hypothesis along with a division of your existing data set such that the hypothesis holds in both halves is much more generous; it can be automated just as easily as finding a hypothesis that works in the unified data set can.

Fair, but that's mitigated if you have a rule that requires an ordering of the data points (say, chronologically). Then there should be no difference between two 500-data-point studies and one 1,000-data-point study partitioned in two (uniquely determined) halves.
This is not a solution. It removes one degree of freedom, the ability to draw the "line" dividing one half of the data set from the other. But an evil or naive scientist has limitless other degrees of freedom to choose from, and can make as many comparisons (in the "multiple comparisons" sense) as they like, undetectably to you.

After you, the good guy, have specified which half of the data is the playground and which is the confirmatory test set, Evil Scientist can still run as many hypotheses as he feels like until he finds one that validates in both halves.

Under the rule "you can only validate a hypothesis by collecting a new data set dedicated to that hypothesis", we, the observers, have a way of guaranteeing that multiple comparisons did not occur. We have no such guarantee under the system you describe.

So to sum up: the rule I describe is not necessary in order to practice good statistics for your own benefit. But it is necessary in order to have a good statistical argument for convincing someone who can't directly perceive the contents of your mind. It's an auditing tool.