Yeah, no thanks though. I don't want every rando adding "priors" that "feel" right to their analysis. Frequentist is straight forward. Both can (and are) abused to prove bias.
The difference between a frequentist and a Bayesian is that the latter admits that he picks a prior. A frequentist smushes together (1) the statistical assumptions (2) the approximations that make the problem computationally tractable and (3) the mathematical derivations, into one big mess. Just because you're not stating your assumptions doesn't mean there are none. Consider maximum likelihood estimation. It is not invariant under coordinate transformations. So which coordinates you pick is an assumption. In fact, with Bayesian estimation you can do the same thing: picking a prior is equivalent to picking the uniform prior in a different coordinate system. So frequentist estimation does involve picking a prior by picking a coordinate system, even if the frequentist does not admit this.
Frequentist methods are conceptually anything but straightforward. The advantage of frequentist methods is that they are computationally tractable. Usually they are best understood as approximations to Bayesian methods. For instance, MLE can be viewed as the variational approximation to Bayes where the family of probability distributions is the family of point masses, and the prior is uniform.
Indeed, it is the argmax of the likelihood, but the likelihood is not invariant under coordinate transformations. The quantity p(x)dx is invariant, not p(x). By picking a suitable coordinate transformation you can put the MLE on any value where the likelihood is not zero.
MLE is not invariant under parameter transformations because it's just the argmax of the likelihood!
Take for example x~normal and exp(x)~lognormal. The maximum of the distribution is at mu for the former and at exp(mu-sigma^2) for the latter, instead of exp(mu).
Adding to the other comments, you still have prior-dependence on a more subtle level, because it depends on what hypotheses are allowed.
Here's an extreme example. Consider flipping an apparently fair coin and getting "THHT". The hypothesis that the coin is fair gives this result with likelihood 1/16. The hypothesis that a worldwide government conspiracy has been formed with the sole purpose of ensuring this result... has a likelihood of 1.
But nobody would ever declare this the MLE, because "government conspiracy" isn't one of the allowed options. But it isn't precisely because it's unlikely, i.e. because of your prior. Of course this is an extreme example, but there are more innocuous prior-based assumptions baked in too.
Wait, in frequentist statistics getting, say, a p-value of 1 is not a bad thing--unless you erroneously assume that value is evidence for your null hypothesis.
Consider that if your data generating process really is a fair coin, then the conspiracy outcome you mention only occurs 1 our of 16 times, so 15 out of 16 times you observe a likelihood of 0. 15 out of 16 times your reject the conspiracy case.
There is also a tricky component here, because the notion of sample size is not clearly defined (can we generate multiple 4-tuples of flips, and consider each one a sample? Is your example really just a funky way of discussing type II power?)
> Wait, in frequentist statistics getting, say, a p-value of 1 is not a bad thing--unless you erroneously assume that value is evidence for your null hypothesis.
That's exactly what I'm saying. Suppose you get HHTHT. Then you run the following statistical test:
Hypothesis: a government conspiracy has been hatched to make you get HHTHT.
Null hypothesis: this is not the case.
The p-value is 1/32, so the null hypothesis is rejected.
This is bad reasoning for two reasons: first the alternative hypothesis is incredibly unlikely, and second the choice of alternative hypothesis has been rigged after seeing the data. These are exactly the two reasons so many social science studies running on frequentist stats have done terribly, and why we would benefit from Bayesian stats which force you to make these issues explicit.
> The p-value is 1/32, so the null hypothesis is rejected.
No, the p-value is defined as the likelihood of a result at least as extreme as the one we obtained, under the null hypothesis. It's not simply the likelihood of the particular result you obtained, as that would always be zero for continuous quantities! (Remember that the p-value's distribution is uniform over the 0-1 interval under the null, so any criticism that says the p-value is almost always small just by chance must be wrong somewhere).
So first you need to establish a way to say what result is how extreme. This is very often trivial and quite objective (the more people cured/made sick, the more extreme the effect of the drug). For the coin flip case, one way would be to call results with more imbalanced ratio more extreme. Then in your 3 heads out of 5 case, the (one sided) p-value would be the likelihood of getting 3, 4 or 5 heads out of 5. You can also come up with a different way to define what "more extreme" means (and put it forward in a convincing way), otherwise you can just not talk about p-values. You can keep talking about likelihoods, but not p-values.
It's strawman to always posit frequentists as unthinking blobs of meat who don't consider the credibility of the alternate hypothesis. In fact, many experimental scientists, physicists, biologists etc. made discoveries using frequentists techniques that didn't rely on boogyman notions of "want to bet the sun just burned out because you're in a closet" nonsense.
What? Can you put in probabilistic terms what "this is not the case" is?
There are an infinite number of models where p(HHTHT | model) != 1, or where p(HHTHT | model) = 0. We need to know which one you're referring to, in order to calculate a p-value.
I think you have made a serious error by believing you can simply "reverse" the model p(HHTHT | conspiracy model) = 1, p(everything else | conspiracy model) = 0.
If the null hypothesis is a fair flip, then the alternative can't be a conspiracy, because the null and alternative need to be complementary statements. So if the null is fair flip, then the alternative is "not fair flip".
>The p-value is 1/32, so the null hypothesis is rejected.
This is incomplete. You need to define a test statistic and know its distribution under your null hypothesis before you can come up with a p value. What's your test statistic here and how is it distributed?
If you define your test after seeing the data, of course you can come up with an arbitrary p value. Choosing a distribution for your null to make it fit an agenda is just like choosing a distribution for your prior after seeing your data to make it fit an agenda.
You could say your prior is a delta function around HHTHT after observing it and get arbitrary evidence, but anyone reading your paper will find it unconvincing, just like anyone reading about a test statistic like this will find it unconvincing.
Your mistake here is in saying that because the p-value is 1/32 you reject the null hypothesis. You just decided to do that with utterly no justification. There is a problem with people unthinkingly deciding that a p-value of .05 is reasonable is most situations but that is not actually an issue with frequentist statistics anymore then people starting out with bizarre priors would be a problem with Bayesian statistics.
Not sure I follow? The hypothesis that the result you see is the result a worldwide government conspiracy is 100% supported by every result that you see. Because it is 100% consistent with the data, a statistical analysis will tell you exactly that--that it is 100% consistent with the data.
Again: Priors can and are used to mislead. Both methods can and are used to mislead. Just moving to Bayes doesn't assume the finding is free of bias all of a sudden.
It doesn't. But the workflow of Bayes forces you be explicit. If you try and cook the books, it will be shown for the world to see. Can you provide a paper that quoted a p value for a regression and also validated all the asymptotic conditions are close to being true in order for that p value to be even somewhat reliable?
If anything, Bayes increases complexity because of a variety of infinite priors that can be chosen. Frequentists is more straight forward because of the removal of this bias. A constant prior along with _actually reading the study/paper_ generally is sufficient. It doesn't preclude future testing. If anything, big discoveries in science require big scrutiny. Bayes does not add anything but complexity and another lever to tune in this regard.
Frequentist has a prior also though. The uniform distribution. In a sense this might be more biased as it doesn't always accurately describe the situation.
Either way I believe the effect of a prior diminishes greatly pretty quickly as you acquire more data. It's only a factor if you have extremely small set of data.
The uniform distribution ("flat") prior lets you interpret a maximum likelihood result as a maximum-a-posteriori (MAP) Bayesian point-estimate (implying a 0-or-1 loss function). One could argue that if you refrain from doing this and just stick to a literal application of the likelihood principle, you're not really depending on a flat prior.
For that matter, what is a "flat" prior over the parameters also depends on what parameterization you're using. Results that are 'intuitive' under one parameterization may not be under a different one.
Frequentist has a prior also though. The uniform distribution.
No. Experimental design affect frequentist conclusions in a way that is inconsistent with _ANY_ prior.
Here is a real life example. My aunt and uncle had 7 children. 6 boys and one girl. Were they biased towards having one gender over another? If the null hypothesis is that they aren't, the p-value that you get is easily calculated as 16/2^7 = 1/8 = 0.125. (There is 1 arrangement of 7 girls, 7 of 6 girls and a boy, 7 of 6 boys and a girl, and 1 of 7 boys for 16 equally likely arrangements.)
If I add the fact that they planned to have children until they had a boy and a girl, then that changes the p-value. In fact there are only 4 ways that their first 7 children can come out to give evidence this strong against the prior. So the p-value is now 4/2^7 = 1/32 = 0.03125.
However a Bayesian looks at this and says that no matter what prior you pick, the knowledge that they planned to have children until they had both a boy and a girl does not affect your posterior conclusion. It literally has nowhere to go in the formula and can't make a difference.
Therefore the frequentist's differing conclusions are not consistent with ANY prior, implicit or not.
> a Bayesian looks at this and says that no matter what prior you pick, the knowledge that they planned to have children until they had both a boy and a girl does not affect your posterior conclusion
A Bayesian would say no such thing. A Bayesian would agree that the knowledge that they planned to have children until they had both a boy and a girl doesn't affect your prior: you still are picking how much probability mass you allocate to all of the possible odds of having a boy vs. a girl, and the couple's plans don't affect that.
However, a Bayesian would also say that the knowledge that they planned to have children until they had both a boy and a girl significantly changes the likelihood ratio (or p-value, if you prefer to use that) associated with the observed data. And one of the advantages of Bayesianism is that it forces you to make that explicit as well.
Notice, for example, that when you calculated the first p-value of 1/8, you implicitly assumed that the couple's plan was "have 7 children, no matter what gender each of them is". The sample space is therefore all possible arrangements of 7 children by gender, and the p-value is 1/8, as you say.
But when you calculated the second p-value of 1/32, while you did change the count of arrangements, you failed to recognize that the sample space changed! Now the possibilities are not just all possible arrangements of 7 children (which is what you used), but all possible arrangements of up to 7 children (because the "stop condition" now is not when there are 7 children total, but when there is at least one child of each gender, and that could have happened at a number of children less than 7). So the correct p-value is not 4/2^7, but 4/(2^7 + 2^6 + 2^5 + 2^4 + 2^3 + 2^2) = 4/(2^8 - 2) = 2/127. A Bayesian, who has to calculate the p-value starting from the hypothesis, not the data, would not make that mistake.
And Bayesianism does something else too: it forces you to recognize that the p-value is not actually the answer to the question you were asking! By the p-value criterion, at least with the typical threshold of 0.05, the null hypothesis (that your aunt and uncle are not biased towards having one gender) is rejected. But a Bayesian recognizes that the prior probability of the gender ratio, based on abundant previous evidence, is strongly peaked around 50-50, much more strongly peaked than data with a bias equivalent to a p-value of 2/127 can overcome. So the Bayesian is quite ready to accept that your aunt and uncle had no actual bias towards having boys, they just happened to be one of the statistical outliers that are to be expected given the huge number of humans who have children.
> the sample space changed! Now the possibilities are not just all possible arrangements of 7 children, but all possible arrangements of up to 7 children [...]
> the "stop condition" now is not when there are 7 children total
Your answer makes no sense to me. If you consider the space of possibles combinations that can lead to having a boy and a girl, why do you stop at seven children. Why consider five boys and one girl but reject seven boys and one girl? Both of them are end cases that could be reached.
Yes, I was posting in a rush and was being sloppy. Here's a more detailed calculation.
The process involved is that the couple continues to have children until they have at least one of each gender. If we assume that at each birth there is a probability p of having a boy (as I noted in my response to btilly elsewhere, the Bayesian prior would actually be a distribution for p, not a point value, but I'll ignore that here for simplicity), then the process can be modeled as a branching tree something like this:
Child #1:
boy -> p;
girl -> 1 - p
Child #2:
boy - boy -> p^2;
boy - girl -> p(1 - p) : STOP;
girl - boy -> (1 - p)p : STOP;
girl - girl -> (1 - p)^2
So we have a probability of 2p(1 - p) of stopping at child 2.
Child #3:
boy - boy - boy -> p^3;
boy - boy - girl -> p^2(1 - p) : STOP;
girl - girl - boy -> p(1 - p)^2 : STOP;
girl - girl - girl -> (1 - p)^3
So we have a probability of [1 - 2p(1 - p)] [p^2(1 - p) + p(1 - p)^2] of stopping at child 3 (the first factor comes from the probability that we didn't stop at child 2 above).
By a similar process we can carry out the tree for as many children as we want. For the case p = 1/2, which was the case I was considering, all of these expressions for the probability of stopping at child #N (for N > 1) simplify to 1 / 2^(N - 1). So the probability of stopping at or before child #N is the sum of those probabilities from 2 to N; for N = 7 that is 1/2 + 1/4 + 1/8 + 1/16 + 1/32 + 1/64 = 63/64. That is close enough to 1 that I ignored cases with more than 7 children; but for a more exact calculation you could add an extra 1/64 to the denominator used to calculate the likelihood (or p-value) of the specific case that was actually observed, to allow for the cases with more than 7 children.
In Bayes' formula, the absolute probability of the observed outcome does not matter. What matters is the ratio of the observed outcome for a given p to the probability under your prior.
The structure of what might have happened does not affect those ratios. Only what was observed does.
> a Bayesian would also say that the knowledge that they planned to have children until they had both a boy and a girl significantly changes the likelihood ratio (or p-value, if you prefer to use that) associated with the observed data.
Actually, on going back and reviewing Jaynes' Probability Theory (section 6.9.1 in particular), I was wrong here, because in this particular case, the parents' choice of process does not affect the likelihood ratio for the data. So btilly was correct that Bayesian reasoning gives thes same posterior distribution for p (the probability of a given child being a boy) for the two data sets. However, in fact, this is not a problem with Bayesian reasoning: it's a problem with frequentist reasoning! In other words, the frequentist argument that the change in the parents' choice of process does affect the inferences we can validly draw from the data, because the p-value changed, is wrong. The Bayesian viewpoint is the one that gives the correct answer.
Here is an argument for why. The underlying assumption in all of our discussion is that, whatever the value of p is, it is the same for all births: in other words, any given birth is independent of all the others in terms of the chance of the child being a boy. And that assumption, all by itself, is enough to show that the parent's choice of process does not matter as far as inferences from identical outcome data is concerned: it can't matter, because the parents' choice of process does not affect p, i.e., it does not affect the underlying fact that each birth is independent of all the others. And as long as each birth is independent of all the others, then the only relevant properties of the data are the total number of children and the number of boys. Nothing else matters. In particular, the p-value, which requires you to look, not just at the relative proportion of boys and girls in the data, but at how "extreme" that proportion is in the overall sample space (since the p-value is the probability that a result "at least that extreme" could be obtained by chance), does not matter.
Here is another way of looking at it. We are analyzing the same data in two different ways based on two different processes for the parents to decide when they will stop having children. This is equivalent to analyzing two different couples, each of whom uses one of the two processes, and whose data is the same (they both have, in order, six boys and one girl). The claim that the different p-values are relevant is then equivalent to the claim that the data from the two couples is being drawn from different underlying distributions. However, these "distributions" are only meaningful if they correspond to something that is actually relevant to the hypothesis being tested. In this case, that would mean that the couple's intentions regarding how they will decide when to stop having children would have to somehow affect p, since the hypothesis we are testing is a hypothesis about p. But they don't. So the two couples are not part of different distributions in any sense that actually matters for this problem, and hence the different p-values we calculate on the basis of those different distributions should not affect how we weigh the data.
In fact, we can even turn this around. Suppose we decide to test the hypothesis that the parents' choice of process does affect p. How would we do that? Well, we would look at couples who were using different processes, and compare the data they produce, expecting to find variation in the data that correlates to the variation in the process. But in this case, the data is the same for two different choices of process--which means that the data is actually evidence against the hypothesis that the choice of process affects p!
Note that this is not a general claim that other information never matters. It is only a specific claim that, in this particular case, other information doesn't matter. It doesn't matter in this case because of the independence property I described above--the fact that every birth is an independent event with the same value of p, unaffected by the variable that differs between the couples (the choice of process). In hypothetical scenarios where the births were not independent, then other information would be relevant; for example, we might want to consider a hypothesis that the age of the parents affected p. A Bayesian would model this by not treating p as a single variable with some assumed prior distribution, but as a function of other variables, which would need to be present in the data (for example, we would have to record the ages of the parents).
How does all this square with the fact that the total sample space certainly does change if the parents' choice of process changes? In the simple case where the process is "have 7 children", every possible outcome is equally likely, so the probability of any single outcome is just 1 / the total number of outcomes. In the case where the process is "have children until there is at least one of each gender", then the outcomes are not all equally likely; the particular outcome that was observed has the same probability as it would under the first process (so btilly is correct about that), but other outcomes have different probabilities. However, as long as each birth is independent, none of those other probabilities affect the inferences we are justified in drawing from the data; only the probability of the actually observed outcome does. (Strictly speaking, as btilly pointed out downthread, it is not the absolute probability that matters but the likelihood ratio; but the likelihood ratio in this case is just the ratio of P(data|p, prior) to P(data|prior), and P(data|prior) is also the same for both data sets since we are assuming the prior for p is independent of the process used to generate the data sets.)
> a Bayesian looks at this and says that no matter what prior you pick, the knowledge that they planned to have children until they had both a boy and a girl does not affect your posterior conclusion
A Bayesian would say no such thing...
Actually they would if they understood the formula. Bayes' formula has no place to put for things that could have been observed had things turned out differently, but which didn't actually happen. Therefore mighta, woulda, coulda but didn't cannot affect your conclusions. Ever.
However, a Bayesian would also say that the knowledge that they planned to have children until they had both a boy and a girl significantly changes the likelihood ratio (or p-value, if you prefer to use that) associated with the observed data. And one of the advantages of Bayesianism is that it forces you to make that explicit as well.
I am not sure how you think that the calculation should be carried out. But it certainly shouldn't be done the way that you describe.
If your prior was that a fraction p of the children would be boys, the odds of the observed outcome would be p^6 * (1-p). It is that regardless of which version of the experiment you run. The conditional probability the outcome being around p given the data is the odds in your prior of the probability being around p, divided by the a priori odds of the observed outcome, 6 boys and then a girl. The calculation is the same in both versions of the experiment and therefore the conclusion is as well.
And Bayesianism does something else too: it forces you to recognize that the p-value is not actually the answer to the question you were asking! By the p-value criterion, at least with the typical threshold of 0.05, the null hypothesis (that your aunt and uncle are not biased towards having one gender) is rejected. But a Bayesian recognizes that the prior probability of the gender ratio, based on abundant previous evidence, is strongly peaked around 50-50, much more strongly peaked than data with a bias equivalent to a p-value of 2/127 can overcome. So the Bayesian is quite ready to accept that your aunt and uncle had no actual bias towards having boys, they just happened to be one of the statistical outliers that are to be expected given the huge number of humans who have children.
Actually a Bayesian with access to actual population data would be aware, as you aren't, that globally we average 1.07 boys to each girl at birth. Therefore most couples, likely including my aunt and uncle, were probably biased towards having boys.
There is a good deal of coincidence involved in my actually having the setup for a classic criticism of frequentism in a close relative. But if it happened, the odds were in favor of it involving 6 boys and a girl rather than the other way around.
> Bayes' formula has no place to put for things that could have been observed had things turned out differently, but which didn't actually happen.
Sure it does: you have to calculate the probability of your data given the hypothesis. Doing that requires considering all possible outcomes of the hypothesis and their relative likelihood, not just the one you actually observed.
> If your prior was that a fraction p of the children would be boys, the odds of the observed outcome would be p^6 (1-p).*
The prior would not actually be a single value for p; it would be a distribution for p over the range (0, 1). The distribution I described was a narrowly peaked Gaussian around p = 0.5, though, as you point out, that might not be the correct value for the peak (see below). However, for illustration purposes, it is much easier to talk about the (idealized, unrealistic) case where your prior is in fact a single point value for p.
However, in order to calculate the odds of the observed outcome, as I said above, you don't just need to know the prior for p. You need to know the process by which the outcomes are generated, according to the hypothesis. The odds you give assume that that process is "bear seven children, regardless of their gender". But that is not the correct process for the actual decision procedure you describe your aunt and uncle as using. That process won't necessarily result in seven children, and the odds of the actually observed outcome will change accordingly.
> a Bayesian with access to actual population data would be aware, as you aren't, that globally we average 1.07 boys to each girl at birth
Depends on whose data you look at and over what time period. But I agree that the best prior to use in a given case would be whatever distribution you get from the data you already have, and yes, that might not be peaked exactly at 50-50.
"globally we average 1.07 boys to each girl at birth. Therefore most couples, likely including my aunt and uncle, were probably biased towards having boys."
Let's say you perform a maximum-likelihood estimate, you still have an assumption baked in, that maximizing the likelihood given the data is the right way to make your estimate.
In fact, it's very interesting to reconstruct a Bayesian prior for a maximum likelihood estimate. For example when you calculate probabilities for a binary event, 10 head flips, 8 tail flips. The ML estimate is 8/18 = 4/9. A Beta-Binomial bayesian model leads to a posterior distribution of Beta(a=8, b=10) with a mean of (8+a0)/(10 + 8 + a0 + b0), with a0 and b0 representing the prior distribution Beta(a0, b0). Now you can see that the maximum likelihood estimate is identical in this case to assuming a bayesian Prior of Beta(0, 0).
I am not saying by this that frequentism is Bayesian inference in disguise, rather, you cannot escape the assumptions.
Also, frequentism is not that straightforward, there are many kinds of frequentist estimators and it can be complicated to choose among them.
The ML estimate is a posterior mode, assuming a flat prior. It's not immediately clear that it will always be possible to find a corresponding posterior mean. (From a Bayesian point of view, this is a difference in loss functions as opposed to priors over the parameters. With a posterior mean, you're making the optimal inference assuming a quadratic loss; a posterior mode is appropriate for a 0-or-1 loss.)
That's not the point. The point is there is no choice between having priors and not. There is only the choice between acknowledging priors versus doublethink, confusion, and deception.
With how fashionable it is to talk about implicit bias, I wonder how those concerns intersect with the people attacking Bayesian approaches here.
Frequentist methods are conceptually anything but straightforward. The advantage of frequentist methods is that they are computationally tractable. Usually they are best understood as approximations to Bayesian methods. For instance, MLE can be viewed as the variational approximation to Bayes where the family of probability distributions is the family of point masses, and the prior is uniform.