Hacker News new | ask | show | jobs
by Akababa 2385 days ago
What do you mean by coordinate transformation? MLE is invariant under parameter transformations because it's just the argmax of the likelihood.
3 comments

Indeed, it is the argmax of the likelihood, but the likelihood is not invariant under coordinate transformations. The quantity p(x)dx is invariant, not p(x). By picking a suitable coordinate transformation you can put the MLE on any value where the likelihood is not zero.
MLE is not invariant under parameter transformations because it's just the argmax of the likelihood!

Take for example x~normal and exp(x)~lognormal. The maximum of the distribution is at mu for the former and at exp(mu-sigma^2) for the latter, instead of exp(mu).

Adding to the other comments, you still have prior-dependence on a more subtle level, because it depends on what hypotheses are allowed.

Here's an extreme example. Consider flipping an apparently fair coin and getting "THHT". The hypothesis that the coin is fair gives this result with likelihood 1/16. The hypothesis that a worldwide government conspiracy has been formed with the sole purpose of ensuring this result... has a likelihood of 1.

But nobody would ever declare this the MLE, because "government conspiracy" isn't one of the allowed options. But it isn't precisely because it's unlikely, i.e. because of your prior. Of course this is an extreme example, but there are more innocuous prior-based assumptions baked in too.

Wait, in frequentist statistics getting, say, a p-value of 1 is not a bad thing--unless you erroneously assume that value is evidence for your null hypothesis.

Consider that if your data generating process really is a fair coin, then the conspiracy outcome you mention only occurs 1 our of 16 times, so 15 out of 16 times you observe a likelihood of 0. 15 out of 16 times your reject the conspiracy case.

There is also a tricky component here, because the notion of sample size is not clearly defined (can we generate multiple 4-tuples of flips, and consider each one a sample? Is your example really just a funky way of discussing type II power?)

> Wait, in frequentist statistics getting, say, a p-value of 1 is not a bad thing--unless you erroneously assume that value is evidence for your null hypothesis.

That's exactly what I'm saying. Suppose you get HHTHT. Then you run the following statistical test:

Hypothesis: a government conspiracy has been hatched to make you get HHTHT.

Null hypothesis: this is not the case.

The p-value is 1/32, so the null hypothesis is rejected.

This is bad reasoning for two reasons: first the alternative hypothesis is incredibly unlikely, and second the choice of alternative hypothesis has been rigged after seeing the data. These are exactly the two reasons so many social science studies running on frequentist stats have done terribly, and why we would benefit from Bayesian stats which force you to make these issues explicit.

> The p-value is 1/32, so the null hypothesis is rejected.

No, the p-value is defined as the likelihood of a result at least as extreme as the one we obtained, under the null hypothesis. It's not simply the likelihood of the particular result you obtained, as that would always be zero for continuous quantities! (Remember that the p-value's distribution is uniform over the 0-1 interval under the null, so any criticism that says the p-value is almost always small just by chance must be wrong somewhere).

So first you need to establish a way to say what result is how extreme. This is very often trivial and quite objective (the more people cured/made sick, the more extreme the effect of the drug). For the coin flip case, one way would be to call results with more imbalanced ratio more extreme. Then in your 3 heads out of 5 case, the (one sided) p-value would be the likelihood of getting 3, 4 or 5 heads out of 5. You can also come up with a different way to define what "more extreme" means (and put it forward in a convincing way), otherwise you can just not talk about p-values. You can keep talking about likelihoods, but not p-values.

> No, the p-value is defined as the likelihood of a result at least as extreme as the one we obtained, under the null hypothesis.

Define for me in an objective way what "at least as extreme" is. Let's say I think the string "HHTHT" is extremely indicative of conspiracy. Then the p-value is 1/32 on the measure of "strings of coin flips at least this extremely indicative of conspiracy".

See, this sounds completely ridiculous, but it's not in principle any different from what it done in thousands of social science papers a year. All these supposedly objective procedures have tons of ambiguity. For example:

> For the coin flip case, one way would be to call results with more imbalanced ratio more extreme.

Why an imbalanced total ratio? Why not average length of heads? Average number of occurrences of "HT"? Frequency of alternations between H and T? Average fraction of times H appears counting only even tosses? Given the combinatorial explosion of possible criteria, I guarantee you I can find a simple-sounding criterion on which any desired string of fair tosses gets a low p-value.

> Why an imbalanced total ratio? Why not average length of heads? Average number of occurrences of "HT"? Frequency of alternations between H and T? Average fraction of times H appears counting only even tosses? Given the combinatorial explosion of possible criteria, I guarantee you I can find a simple-sounding criterion on which any desired string of fair tosses gets a low p-value.

Sure you can p-hack and people definitely do it. Still, good papers argue for any unconventional choice of what they mean by extreme.

> Let's say I think the string "HHTHT" is extremely indicative of conspiracy.

Then I as your peer-reviewer will say I require more justification for your premise. Usually what counts as more extreme is not up to each paper to define, but depends on the conventions of a field that were agreed upon by domain-level reasoning, so you don't always have so many degrees of freedom left (but still have some, that's why p-hacking is a hot topic.)

Again, you're arguing against p-hacking: coming up with your criterion for what counts as extreme after looking at your observation.

Indeed if we assume no p-hacking, things look much nicer. If for some reason you've for years argued on YouTube that there's a conspiracy to make the 5 coin tosses that person X will perform on live TV on this and this date to be biased towards HHTHT, and then it actually does end up being HHTHT on live TV, then I think it's fair to say we can reject the null hypothesis at the level of p=1/32. It doesn't mean we absolutely for eternity have rejected it, but I guess it's worth accepting a paper about your analysis and discussion (taking the analogy back to science). We're already accepting a 5% false positive ratio anyway.

>Define for me in an objective way what "at least as extreme" is.

Come up with some one dimensional test statistic T whose distribution D you know under your null hypothesis. Define a one sided p value for data x as p(t <= x).

It sounds like your statistic is 0 if the sequence is always "HHTHT" and 1 otherwise? In this case your p value is 1 unless every attempt is "HHTHT" in which case it's zero, so the test statistic is 0 with probability 1/32^k for k attempts. The more attempts you do, the smaller p gets if the null is false. It's working as intended. For this test, a threshold of p=0.05 would be dumb, but it's always dumb.

It's not an awful test assuming you came up with your test statistic and "HHTHT" before collecting your data. It meshes with the intuition of betting your friend "Hey I bet if you flip this coin you'll get HHTHT." If they proceed to flip it and see HHTHT, they are reasonable to think maybe you know something they don't.

If you come up with your test statistic after the fact, there's theory around p hacking to formalize the intuition of why it's not convincing to watch your friend flip some sequence of coins and then tell them "dude, I totally knew it was going to be that" after the fact.

It's strawman to always posit frequentists as unthinking blobs of meat who don't consider the credibility of the alternate hypothesis. In fact, many experimental scientists, physicists, biologists etc. made discoveries using frequentists techniques that didn't rely on boogyman notions of "want to bet the sun just burned out because you're in a closet" nonsense.
I'm a physicist that uses frequentist statistics, and it works fine. However, it can't be denied that some fields misuse it, though precisely the failure modes I pointed out.
What? Can you put in probabilistic terms what "this is not the case" is?

There are an infinite number of models where p(HHTHT | model) != 1, or where p(HHTHT | model) = 0. We need to know which one you're referring to, in order to calculate a p-value.

I think you have made a serious error by believing you can simply "reverse" the model p(HHTHT | conspiracy model) = 1, p(everything else | conspiracy model) = 0.

If the null hypothesis is a fair flip, then the alternative can't be a conspiracy, because the null and alternative need to be complementary statements. So if the null is fair flip, then the alternative is "not fair flip".

edit: whoops, changed mutually exclusive to complementary. see http://www.its.caltech.edu/~mshum/stats/lect8.pdf

The exact point I am making is that all of this is totally up to the researcher. This is the standard methodology in social science: yes, in theory a low p-value does nothing but support the complement of a fairly bland null hypothesis. But in reality that's not what people do. Instead any low p-value is taken as proof of an extremely specific alternative hypothesis.
>Null hypothesis: this is not the case.

>The p-value is 1/32, so the null hypothesis is rejected.

This is incomplete. You need to define a test statistic and know its distribution under your null hypothesis before you can come up with a p value. What's your test statistic here and how is it distributed?

If you define your test after seeing the data, of course you can come up with an arbitrary p value. Choosing a distribution for your null to make it fit an agenda is just like choosing a distribution for your prior after seeing your data to make it fit an agenda.

You could say your prior is a delta function around HHTHT after observing it and get arbitrary evidence, but anyone reading your paper will find it unconvincing, just like anyone reading about a test statistic like this will find it unconvincing.

Your mistake here is in saying that because the p-value is 1/32 you reject the null hypothesis. You just decided to do that with utterly no justification. There is a problem with people unthinkingly deciding that a p-value of .05 is reasonable is most situations but that is not actually an issue with frequentist statistics anymore then people starting out with bizarre priors would be a problem with Bayesian statistics.
"that is not actually an issue with frequentist statistics"

To me that sounds exactly like when people say everything that goes wrong with cryptocurrency in practice is not a problem with the concepts.

Not sure I follow? The hypothesis that the result you see is the result a worldwide government conspiracy is 100% supported by every result that you see. Because it is 100% consistent with the data, a statistical analysis will tell you exactly that--that it is 100% consistent with the data.
Again: Priors can and are used to mislead. Both methods can and are used to mislead. Just moving to Bayes doesn't assume the finding is free of bias all of a sudden.
It doesn't. But the workflow of Bayes forces you be explicit. If you try and cook the books, it will be shown for the world to see. Can you provide a paper that quoted a p value for a regression and also validated all the asymptotic conditions are close to being true in order for that p value to be even somewhat reliable?