Hacker News new | ask | show | jobs
by bachmeier 1971 days ago
> Sure, if you choose a bad starting point, your initial samples might not be representative of the overall distribution, but if a handful of non-representative points can massively impact your result, then I'm not sure how stable your result was to begin with (how do you know there isn't some other set of low-probability high-impact points that your sampler just missed through luck?).

You're right, and most comments I've seen over the years on the post conveniently miss that he addresses that:

> This unbiasedness argument is rubbish. If you start at x and I start at x then your MCMC run is no better than mine. If you used burn-in and I didn't, then you are entitled to woof about approximate unbiasedness and I am not. But that woof does not make your estimator any better.

My interpretation has always been this, and I think it's correct: You need a good starting point. There's no reason to think burn-in gives you a good starting point. Instead, use something that's actually intended to give a good starting point, like the mode.

1 comments

For difficult problems, find the mode may (a) be as hard to find, or harder, as doing an MCMC sampling run, (b) be completely unrepresentative of the overall distribution.
I agree, but his argument is that in general doing a burn-in is still not going to be a substitute for good starting values, and if anything it's even easier to get a bad starting value using burn-in on a difficult problem.
If you have some nice idea of how to find a good starting value, then you should certainly use it, not just rely on burn-in.

But having used your good starting value, you should still discard some burn-in iterations. This is certainly true if you're running more than one chain, since including them all with this same starting value will bias the results (in a real, not just theoretical sense, though the magnitude of the bias will of course vary with your problem). Even if you're running just one chain, you should discard at least some burn-in (say 5%) even if you have no evidence that it is necessary, because you really don't know that your supposed good starting point is actually representative. (That is, you don't know this a difficult problems, which are the ones I'm discussing.)

I don't understand how the mode can be unrepresentative of the overall distribution. It seems like it's one of the finest representatives.
This can happen easily in Bayesian hierarchical models, where there is a hyperparameter that controls the variance of many lower-level parameters. When the variance is small, the probability density for these parameters is high (their distribution is sharply peaked), when the variance is large, the density is smaller (maybe many, many orders of magnitude smaller). So the mode will be where the variance is small, even if the data make this a much less probable region of the parameter space. (Note: the probability of a region is the product of its volume and its probability density - the total probability can be low even if the density is extremely high.)

You'll also typically get an unrepresentative mode for a neural network or other ML-type model, since the mode will be a highly-overfitted point.