For difficult problems, find the mode may (a) be as hard to find, or harder, as doing an MCMC sampling run, (b) be completely unrepresentative of the overall distribution.
I agree, but his argument is that in general doing a burn-in is still not going to be a substitute for good starting values, and if anything it's even easier to get a bad starting value using burn-in on a difficult problem.
If you have some nice idea of how to find a good starting value, then you should certainly use it, not just rely on burn-in.
But having used your good starting value, you should still discard some burn-in iterations. This is certainly true if you're running more than one chain, since including them all with this same starting value will bias the results (in a real, not just theoretical sense, though the magnitude of the bias will of course vary with your problem). Even if you're running just one chain, you should discard at least some burn-in (say 5%) even if you have no evidence that it is necessary, because you really don't know that your supposed good starting point is actually representative. (That is, you don't know this a difficult problems, which are the ones I'm discussing.)
This can happen easily in Bayesian hierarchical models, where there is a hyperparameter that controls the variance of many lower-level parameters. When the variance is small, the probability density for these parameters is high (their distribution is sharply peaked), when the variance is large, the density is smaller (maybe many, many orders of magnitude smaller). So the mode will be where the variance is small, even if the data make this a much less probable region of the parameter space. (Note: the probability of a region is the product of its volume and its probability density - the total probability can be low even if the density is extremely high.)
You'll also typically get an unrepresentative mode for a neural network or other ML-type model, since the mode will be a highly-overfitted point.