Hacker News new | ask | show | jobs
by bonoboTP 2385 days ago
> The p-value is 1/32, so the null hypothesis is rejected.

No, the p-value is defined as the likelihood of a result at least as extreme as the one we obtained, under the null hypothesis. It's not simply the likelihood of the particular result you obtained, as that would always be zero for continuous quantities! (Remember that the p-value's distribution is uniform over the 0-1 interval under the null, so any criticism that says the p-value is almost always small just by chance must be wrong somewhere).

So first you need to establish a way to say what result is how extreme. This is very often trivial and quite objective (the more people cured/made sick, the more extreme the effect of the drug). For the coin flip case, one way would be to call results with more imbalanced ratio more extreme. Then in your 3 heads out of 5 case, the (one sided) p-value would be the likelihood of getting 3, 4 or 5 heads out of 5. You can also come up with a different way to define what "more extreme" means (and put it forward in a convincing way), otherwise you can just not talk about p-values. You can keep talking about likelihoods, but not p-values.

1 comments

> No, the p-value is defined as the likelihood of a result at least as extreme as the one we obtained, under the null hypothesis.

Define for me in an objective way what "at least as extreme" is. Let's say I think the string "HHTHT" is extremely indicative of conspiracy. Then the p-value is 1/32 on the measure of "strings of coin flips at least this extremely indicative of conspiracy".

See, this sounds completely ridiculous, but it's not in principle any different from what it done in thousands of social science papers a year. All these supposedly objective procedures have tons of ambiguity. For example:

> For the coin flip case, one way would be to call results with more imbalanced ratio more extreme.

Why an imbalanced total ratio? Why not average length of heads? Average number of occurrences of "HT"? Frequency of alternations between H and T? Average fraction of times H appears counting only even tosses? Given the combinatorial explosion of possible criteria, I guarantee you I can find a simple-sounding criterion on which any desired string of fair tosses gets a low p-value.

> Why an imbalanced total ratio? Why not average length of heads? Average number of occurrences of "HT"? Frequency of alternations between H and T? Average fraction of times H appears counting only even tosses? Given the combinatorial explosion of possible criteria, I guarantee you I can find a simple-sounding criterion on which any desired string of fair tosses gets a low p-value.

Sure you can p-hack and people definitely do it. Still, good papers argue for any unconventional choice of what they mean by extreme.

> Let's say I think the string "HHTHT" is extremely indicative of conspiracy.

Then I as your peer-reviewer will say I require more justification for your premise. Usually what counts as more extreme is not up to each paper to define, but depends on the conventions of a field that were agreed upon by domain-level reasoning, so you don't always have so many degrees of freedom left (but still have some, that's why p-hacking is a hot topic.)

Again, you're arguing against p-hacking: coming up with your criterion for what counts as extreme after looking at your observation.

Indeed if we assume no p-hacking, things look much nicer. If for some reason you've for years argued on YouTube that there's a conspiracy to make the 5 coin tosses that person X will perform on live TV on this and this date to be biased towards HHTHT, and then it actually does end up being HHTHT on live TV, then I think it's fair to say we can reject the null hypothesis at the level of p=1/32. It doesn't mean we absolutely for eternity have rejected it, but I guess it's worth accepting a paper about your analysis and discussion (taking the analogy back to science). We're already accepting a 5% false positive ratio anyway.

>Define for me in an objective way what "at least as extreme" is.

Come up with some one dimensional test statistic T whose distribution D you know under your null hypothesis. Define a one sided p value for data x as p(t <= x).

It sounds like your statistic is 0 if the sequence is always "HHTHT" and 1 otherwise? In this case your p value is 1 unless every attempt is "HHTHT" in which case it's zero, so the test statistic is 0 with probability 1/32^k for k attempts. The more attempts you do, the smaller p gets if the null is false. It's working as intended. For this test, a threshold of p=0.05 would be dumb, but it's always dumb.

It's not an awful test assuming you came up with your test statistic and "HHTHT" before collecting your data. It meshes with the intuition of betting your friend "Hey I bet if you flip this coin you'll get HHTHT." If they proceed to flip it and see HHTHT, they are reasonable to think maybe you know something they don't.

If you come up with your test statistic after the fact, there's theory around p hacking to formalize the intuition of why it's not convincing to watch your friend flip some sequence of coins and then tell them "dude, I totally knew it was going to be that" after the fact.

A more general method is to use the likelihood ratio, ie the ratio of the likelihood of an outcome under the alternative hypothesis to its likelihood under the null hypothesis. And then pick the outcomes which for which this ratio is highest as the ones which will cause you to reject the null hypothesis. Equivalently, the p-value is the probability under the null hypothesis that the likelihood ratio would be at least this large.

This works in the discrete case too, and gives p=1/32 in the original coin flip case.

Is the likelihood ratio test more general? I thought that one of the benefits of the usual NHST framework was that you only need the distribution of your stat under the null. With LRT don't you need the distribution under both the null and the alternative? How do you frame a null of mu = 0 against an alternative of mu != 0 with x ~ D_mu in this way?
You don't necessarily need the distribution under the alternative to determine the values for which the likelihood ratio will be highest. In your example, the tails will be the areas of maximum likelihood for any (symmetric) alternative.
Huh, TIL. Thanks :)