Hacker News new | ask | show | jobs
by spekcular 1170 days ago
The items you listed are certainly problems, but p-values don't have much to do with them, as far as I can see. Poor power is an experimental design problem, not a problem with the analysis technique. Not reporting all analyses is a data censoring problem (this is what I understand "specification curve analysis" to mean, based on some Googling - let me know if I misinterpreted). Again, this can't really be fixed at the analysis stage (at least without strong assumptions on the form of the censoring). The replication crisis is a combination of these these two things, and other design issues.
2 comments

I can understand why you see it this way, but still disagree:

(1) p-values make significance the target, and thus create incentives for underpowered studies, misspecified analyses, early stopping (monitoring significance while collecting data), and p-hacking.

(2) p-values separate crucial pieces of information. It represents a highly specific probability (of the observed data, given the null hypothesis is true), but does not include effect size or a comprehensive estimate of uncertainty. Thus, to be useful, p-values need to be combined with effect sizes and ideally simulations, specification curves, or meta-analyses.

Thus my primary problem with p-values is that they are an incomplete solution that is too easy to use incorrectly. Ultimately, they just don't convey enough information in their single summary. CIs, for example, are just as simple to communicate, but much more informative.

I don't understand. CIs are equivalent to computing a bunch of p-values, by test-interval duality. Should I interpret your points as critiques of simple analyses that only test a single point null of no effect (and go no further)? (I would agree that is bad.)
Yes, I argue that individual p-values (as they are used almost exclusively in numerous disciplines) are bad, and adding more information on effect size and errors are needed. CIs do that by conveying (1) significance (does not include zero), (2) magnitude of effect (mean of CI), and (3) errors/noise (width of CI). That's significantly better than a single p-value (excuse the pun).
I think part of the problem with p-values and NHST is that it encourages (or doesn't discourage) underpowered studies. That's because p-hacking benefits from the noise of underpowered studies. If you can test a large number of models and only report the significant one then an underpowered study with high type I error rate gives you a greater chance of a significant result.

So I think you are correct that properly powering studies is the crucial thing, but the incentives are against fixing this as long as lone p-values are publishable.

But here the issue is the uncorrected multiple testing and under-reporting of results, not the p-values themselves. Any criterion for judging the presence of an effect is going to suffer from the same issue, if researchers don't pre-register and report all of their analyses (since otherwise you have censored data, "researcher degrees of freedom," and so on). This is really a a problem with the design and reporting of studies, not the analysis method.