Hacker News new | ask | show | jobs
by hansvm 469 days ago
> By design, AR models lack planning and reasoning capabilities. If you generate one word at a time, you don’t really have a general idea of where you’re heading.

I have one minor quibble here, which is that the limitation described isn't a criticism of AR models (whose outputs are only "backward-looking" for their inputs), but just a subset of AR models in popular use. An AR model is fully capable of generating a large state space and doing many computations (even doing many full-connected diffusion steps) before generating the first output token.

That quibble wouldn't be worth mentioning unless AR models had some sort of advantage, but they do, and it's incredibly important. AR factorization of the conditional probabilities allows you to additively consider the loss contribution from each output token -- you can blindly shove whatever data you want into the thing, add up all the errors, and backpropagate, all while guaranteeing that the distribution you're learning is the same distribution from your training data.

If you're not careful, via some mechanism (like AR), the distribution you learn will have almost nothing to do with the distribution you're training on -- a common failure mode being a tendancy to predict "average-looking" sub-tiles in a composite image and only predict images which can be comprised out of those smaller, averge-looking sub-tiles. Imagine (as an example, with low enough model capacity), you had a model generating people and everyone was vaguely 5'10", ambiguously gendered, and a bit tan, contrasted with that same model trained using AR where you'd expect the outputs to be bad in other ways if you had insufficient capacity but to at least have a mix of colors, heights, and genders. Increasing capacity can help, but why bother when something like AR solves it by definition?