Hacker News new | ask | show | jobs
by mattb314 1971 days ago
I think I generally agree with the majority of the comments here that burn in can serve a useful purpose (especially if you can't find a high probability density point to start from), but I also wonder: if burn-in vs no burn-in makes a large difference in your outcome, aren't you likely just not running your chain long enough? Sure, if you choose a bad starting point, your initial samples might not be representative of the overall distribution, but if a handful of non-representative points can massively impact your result, then I'm not sure how stable your result was to begin with (how do you know there isn't some other set of low-probability high-impact points that your sampler just missed through luck?). People tend to have a cognitive bais towards distributions looking pretty (eg not having random chains off to the side as in the article), but I'm not sure it makes a real difference.

That said, I do think burn in is a pretty reasonable way to find a good starting point if you don't have existing knowledge about the distribution. From a practical standpoint, has anyone actually seen a massive difference between runs with/without burn in? kinda curious how often it really matters

1 comments

> Sure, if you choose a bad starting point, your initial samples might not be representative of the overall distribution, but if a handful of non-representative points can massively impact your result, then I'm not sure how stable your result was to begin with (how do you know there isn't some other set of low-probability high-impact points that your sampler just missed through luck?).

You're right, and most comments I've seen over the years on the post conveniently miss that he addresses that:

> This unbiasedness argument is rubbish. If you start at x and I start at x then your MCMC run is no better than mine. If you used burn-in and I didn't, then you are entitled to woof about approximate unbiasedness and I am not. But that woof does not make your estimator any better.

My interpretation has always been this, and I think it's correct: You need a good starting point. There's no reason to think burn-in gives you a good starting point. Instead, use something that's actually intended to give a good starting point, like the mode.

For difficult problems, find the mode may (a) be as hard to find, or harder, as doing an MCMC sampling run, (b) be completely unrepresentative of the overall distribution.
I agree, but his argument is that in general doing a burn-in is still not going to be a substitute for good starting values, and if anything it's even easier to get a bad starting value using burn-in on a difficult problem.
If you have some nice idea of how to find a good starting value, then you should certainly use it, not just rely on burn-in.

But having used your good starting value, you should still discard some burn-in iterations. This is certainly true if you're running more than one chain, since including them all with this same starting value will bias the results (in a real, not just theoretical sense, though the magnitude of the bias will of course vary with your problem). Even if you're running just one chain, you should discard at least some burn-in (say 5%) even if you have no evidence that it is necessary, because you really don't know that your supposed good starting point is actually representative. (That is, you don't know this a difficult problems, which are the ones I'm discussing.)

I don't understand how the mode can be unrepresentative of the overall distribution. It seems like it's one of the finest representatives.
This can happen easily in Bayesian hierarchical models, where there is a hyperparameter that controls the variance of many lower-level parameters. When the variance is small, the probability density for these parameters is high (their distribution is sharply peaked), when the variance is large, the density is smaller (maybe many, many orders of magnitude smaller). So the mode will be where the variance is small, even if the data make this a much less probable region of the parameter space. (Note: the probability of a region is the product of its volume and its probability density - the total probability can be low even if the density is extremely high.)

You'll also typically get an unrepresentative mode for a neural network or other ML-type model, since the mode will be a highly-overfitted point.