|
This is a huge problem in scientific papers; it's very, very common to see results for all kinds of metrics with confidence intervals or p-values and then see a few "significant" measurements, without a mention of the fact that multiple tests were made - and implicitly, possibly many more tests were made during the exploratory phase of the research. What does significance even mean then? Hard to say (there are techniques to try and compensate of course, but they have their own issues). One simple way we can at least mitigate that problem is by requiring far lower p-values (or wider CI's), and where that's not feasible, require a much clearer-eyed explanation and acceptance of the fact that such research cannot be trivially supported by statistics, and instead additionally requires careful experimental setup and consideration of causal networks. Basically: if you have p = 0.0001 or whatever I'm more willing to believe that publication biases and multiple testing aren't super likely to cause false positives that often. But without that, you want a clear hypothesis and proposed way to test published beforehand, and just one test, and ideally a clear hypothesis about causation etc too, so you can critically push and prod the results to try and distinguish noise from signal. A p=0.03 just isn't very obvious, at all. In general, I think modern science is too reliant on statistics over complex systems, and in the effort to tease out significance then needs to try and correct for all kinds of known interference (confounders) and other effects; thus then need to use more advanced statistical models and less general assumptions about distributions (whether for significance, or for mathematical tractability), that it's just very hard for anyone to say they didn't make some systematic error somewhere. And sure, being an expert in the subject matter and having an expert statistician on hand helps, but making reasoning errors is too easy; too human to reliably avoid. Instead of seeking signals in noise, we should be targeting research more narrowly to parts of the puzzle that we can measure better, then use classical plain logic to put the pieces together - not try and measure the whole thing in one go. After we put all the well-measured pieces together, validating with tricky statistics is reasonable as a sanity check, but not much more than that. If common sense is hard, statistics is harder, even for statisticians. Interpreting results like this as any more than "huh, that's something we could look into" is unwise. |