Hacker News new | ask | show | jobs
by spekcular 1172 days ago
I think if hypothesis testing is understood properly, these objections don't have much teeth.

1. Typically we use p-values to construct confidence intervals, answering the concern about quantifying the effect size. (That is, the confidence interval is the collection of all values not rejected by the hypothesis test.)

2. P-values control type I error. Well-powered designs control type I and type II error. Good control of these errors is a kind of minimal requirement for a statistical procedure. Your example shows that we should perhaps consider more than just these aspects, but we should certainly be suspicious of any procedure that doesn't have good type I and II error control.

3. This is a problem with any kind of statistical modeling, and is not specific to p-values. All statistical techniques make assumptions that generally render them invalid when violated.

2 comments

Your points are theoretically correct, and probably the reason why many statisticians still regard p-values and HNST favorably.

But looking at the practical application, in particular the replication crisis, specification curve analysis, de facto power of published studies and many more, we see that there is an immense practical problem and p-values are not making it better.

We need to criticize p-values and NHST hard, not because they cannot be used correctly, but because they are not used correctly (and are arguably hard to use right, see the Gigerenzer paper I linked).

The items you listed are certainly problems, but p-values don't have much to do with them, as far as I can see. Poor power is an experimental design problem, not a problem with the analysis technique. Not reporting all analyses is a data censoring problem (this is what I understand "specification curve analysis" to mean, based on some Googling - let me know if I misinterpreted). Again, this can't really be fixed at the analysis stage (at least without strong assumptions on the form of the censoring). The replication crisis is a combination of these these two things, and other design issues.
I can understand why you see it this way, but still disagree:

(1) p-values make significance the target, and thus create incentives for underpowered studies, misspecified analyses, early stopping (monitoring significance while collecting data), and p-hacking.

(2) p-values separate crucial pieces of information. It represents a highly specific probability (of the observed data, given the null hypothesis is true), but does not include effect size or a comprehensive estimate of uncertainty. Thus, to be useful, p-values need to be combined with effect sizes and ideally simulations, specification curves, or meta-analyses.

Thus my primary problem with p-values is that they are an incomplete solution that is too easy to use incorrectly. Ultimately, they just don't convey enough information in their single summary. CIs, for example, are just as simple to communicate, but much more informative.

I don't understand. CIs are equivalent to computing a bunch of p-values, by test-interval duality. Should I interpret your points as critiques of simple analyses that only test a single point null of no effect (and go no further)? (I would agree that is bad.)
Yes, I argue that individual p-values (as they are used almost exclusively in numerous disciplines) are bad, and adding more information on effect size and errors are needed. CIs do that by conveying (1) significance (does not include zero), (2) magnitude of effect (mean of CI), and (3) errors/noise (width of CI). That's significantly better than a single p-value (excuse the pun).
I think part of the problem with p-values and NHST is that it encourages (or doesn't discourage) underpowered studies. That's because p-hacking benefits from the noise of underpowered studies. If you can test a large number of models and only report the significant one then an underpowered study with high type I error rate gives you a greater chance of a significant result.

So I think you are correct that properly powering studies is the crucial thing, but the incentives are against fixing this as long as lone p-values are publishable.

But here the issue is the uncorrected multiple testing and under-reporting of results, not the p-values themselves. Any criterion for judging the presence of an effect is going to suffer from the same issue, if researchers don't pre-register and report all of their analyses (since otherwise you have censored data, "researcher degrees of freedom," and so on). This is really a a problem with the design and reporting of studies, not the analysis method.
That sounds like if you write proper C code correctly you don’t make memory errors when in reality it’s very common to not write correct code.

That’s why rust came along, to stop that behaviour, you simply can’t make that mistake, and hence the point is maybe there’s a better test to use than p value as a standard.

How else do you propose to construct procedures that control type I error and evaluate their properties?