|
|
|
|
|
by dhampi
199 days ago
|
|
I don't understand the analogy. If I'm using an MCMC algorithm to sample a probability distribution, I need to wait for my Markov chain to converge to a stationary distribution before sampling, sure. But in no way is 'a good answer' a stationary state in the LLM Markov chain. If I continue running next-token prediction, I'm not going to start looping. |
|
So for language when I say "Bob has three apples, Jane gives him four and Judy takes two how many apples does Bob have" we're actually pretty far from the part of the linguistic manifold where the correct answer is likely to be. As the chain wanders this space it's getting closer until it finally statistically follow the path "this answer is..." and when it's sampling from this path it's in a much more likely neighborhood of the correct answer. That is, after wandering a bit, more and more of the possible paths are closer to where the actual answer lies than they would be if we had just forced the model to choose early.
edit: Michael Betancourt has great introduction to HMC which covers warm-up and the typical set https://arxiv.org/pdf/1701.02434 (he has a ton more content that dives much more deeply into the specifics)