|
|
|
|
|
by pama
409 days ago
|
|
Unfortunately the intuition and the math proofs so far suggest that autoregressive training is learning the joint distribution of probabilistic streams of tokens much better than diffision models do or will ever do. My intuitive take is that the conditional probability distribtion of decoder-only autoregressive models is at just the right level of complexity for probabilistic models to learn accurately enough. Intuitively (and simplifying things at the risk of breaking rigor), the diffusion (or masked models) have to occasionally issue tokens with less information and thus higher variance than a pure autoregressive model would have to do, so the joint distribution, ie the probability of the whole sentence/answer will be lower and thus diffusion models will never get precise enough. Of course, during generation the sampling techniques influence the above simplified idea dramatically and the typical randomized sampling for next token prediction is suboptimal and could be beaten by a carefully designed block diffusion sampler in principle in some contexts though I havent seen real examples of it yet. But the key ideas of the above scribbles are still valid: autoregresive models will always be better (or at least equal) probabilistic models of sequential data than diffusion models will be. So the diffusion models mostly offer a tradeoff for performance vs quality. Sometimes there is a lot of room for that tradeoff in practice. |
|
Could you point me to some literature? Especially regarding mathematical proofs of your intuition?
I’d like to recalibrate my priors to align better with current research results.