Hacker News new | ask | show | jobs
by radford-neal 1125 days ago
As the author admits at the end, this is rather misleading. In normal usage, "overfit" is by definition a bad thing (it wouldn't be "over" if it was good). And the argument given does nothing to show that Bayesian inference is doing anything bad.

To take a trivial example, suppose you have a uniform(0,1) prior for the probability of a coin landing heads. Integrating over this gives a probability for heads of 1/2. You flip the coin once, and it lands heads. If you integrate over the posterior given this observation, you'll find that the probability of the value in the observation, which is heads, is now 2/3, greater than it was under the prior.

And that's OVERFITTING, according to the definition in the blog post.

Not according to any sensible definition, however.

2 comments

I was writing another comment based on that same example and his leaving-one-out calculations (at least based on what I understood).

The posterior vs prior would be the extreme case of a leaving-one-out procedure - leaving the only data point out there is nothing left.

The divergence between the data and the model goes down when we include information about the data in the model. That doesn't seem a controversial opinion. (That's how the blog post is introduced here: https://twitter.com/YulingYao/status/1662284440603619328)

---

If the data consists of two flips they are either equal or different (the former becomes more likely as the true probability diverges from 0.5).

a) If the data is the same, the posterior probability of that result is 3/4. The log score is 2 log(3/4) = -0.6

When we check the out-of-sample log score for each one based on the 2/3 posterior obtained from the other we get in each case a log score log(2/3) = -0.4

b) If the data is different, the posterior probability is still 1/2. The log score is 2 log(1/2) = 2 -0.7 = -1.4

When we check the out-of-sample log score for each one based on the 1/3 posterior for getting that result obtained from the other we get in each case a log score log(1/3) = -1.1

When there is a small amount of information the variance of any estimation is very big and this explains what happens in that example. Overfitting implies a different behavior in training and in test and this is related to a big variance in the estimation of the error. So small amount of information implies that any model suffer overfitting and big variance, so is a general result not related especifically with Bayes.