Hacker News new | ask | show | jobs
by pdonis 2383 days ago
> a Bayesian would also say that the knowledge that they planned to have children until they had both a boy and a girl significantly changes the likelihood ratio (or p-value, if you prefer to use that) associated with the observed data.

Actually, on going back and reviewing Jaynes' Probability Theory (section 6.9.1 in particular), I was wrong here, because in this particular case, the parents' choice of process does not affect the likelihood ratio for the data. So btilly was correct that Bayesian reasoning gives thes same posterior distribution for p (the probability of a given child being a boy) for the two data sets. However, in fact, this is not a problem with Bayesian reasoning: it's a problem with frequentist reasoning! In other words, the frequentist argument that the change in the parents' choice of process does affect the inferences we can validly draw from the data, because the p-value changed, is wrong. The Bayesian viewpoint is the one that gives the correct answer.

Here is an argument for why. The underlying assumption in all of our discussion is that, whatever the value of p is, it is the same for all births: in other words, any given birth is independent of all the others in terms of the chance of the child being a boy. And that assumption, all by itself, is enough to show that the parent's choice of process does not matter as far as inferences from identical outcome data is concerned: it can't matter, because the parents' choice of process does not affect p, i.e., it does not affect the underlying fact that each birth is independent of all the others. And as long as each birth is independent of all the others, then the only relevant properties of the data are the total number of children and the number of boys. Nothing else matters. In particular, the p-value, which requires you to look, not just at the relative proportion of boys and girls in the data, but at how "extreme" that proportion is in the overall sample space (since the p-value is the probability that a result "at least that extreme" could be obtained by chance), does not matter.

Here is another way of looking at it. We are analyzing the same data in two different ways based on two different processes for the parents to decide when they will stop having children. This is equivalent to analyzing two different couples, each of whom uses one of the two processes, and whose data is the same (they both have, in order, six boys and one girl). The claim that the different p-values are relevant is then equivalent to the claim that the data from the two couples is being drawn from different underlying distributions. However, these "distributions" are only meaningful if they correspond to something that is actually relevant to the hypothesis being tested. In this case, that would mean that the couple's intentions regarding how they will decide when to stop having children would have to somehow affect p, since the hypothesis we are testing is a hypothesis about p. But they don't. So the two couples are not part of different distributions in any sense that actually matters for this problem, and hence the different p-values we calculate on the basis of those different distributions should not affect how we weigh the data.

In fact, we can even turn this around. Suppose we decide to test the hypothesis that the parents' choice of process does affect p. How would we do that? Well, we would look at couples who were using different processes, and compare the data they produce, expecting to find variation in the data that correlates to the variation in the process. But in this case, the data is the same for two different choices of process--which means that the data is actually evidence against the hypothesis that the choice of process affects p!

Note that this is not a general claim that other information never matters. It is only a specific claim that, in this particular case, other information doesn't matter. It doesn't matter in this case because of the independence property I described above--the fact that every birth is an independent event with the same value of p, unaffected by the variable that differs between the couples (the choice of process). In hypothetical scenarios where the births were not independent, then other information would be relevant; for example, we might want to consider a hypothesis that the age of the parents affected p. A Bayesian would model this by not treating p as a single variable with some assumed prior distribution, but as a function of other variables, which would need to be present in the data (for example, we would have to record the ages of the parents).

How does all this square with the fact that the total sample space certainly does change if the parents' choice of process changes? In the simple case where the process is "have 7 children", every possible outcome is equally likely, so the probability of any single outcome is just 1 / the total number of outcomes. In the case where the process is "have children until there is at least one of each gender", then the outcomes are not all equally likely; the particular outcome that was observed has the same probability as it would under the first process (so btilly is correct about that), but other outcomes have different probabilities. However, as long as each birth is independent, none of those other probabilities affect the inferences we are justified in drawing from the data; only the probability of the actually observed outcome does. (Strictly speaking, as btilly pointed out downthread, it is not the absolute probability that matters but the likelihood ratio; but the likelihood ratio in this case is just the ratio of P(data|p, prior) to P(data|prior), and P(data|prior) is also the same for both data sets since we are assuming the prior for p is independent of the process used to generate the data sets.)