maybe i understand this a little differently, the argument i am most familiar with is this one from lecun, where the error accumulation in the prediction is the concern with autoregression https://drive.google.com/file/d/1BU5bV3X5w65DwSMapKcsr0ZvrMR...
The error accumulation thing is basically without any ground as regressive models correct what they are saying in the process of emitting tokens (trivial to test yourself: force a given continuation in the prompt and the LLMs will not follow at all). LeCun provided an incredible amount of wrong claims about LLMs, many of which he now no longer accepts: like the stochastic parrot claim. Now the idea that there is just a statistical relationship in the next token prediction is considered laughable, but even when it was formulated there were obvious empirical hints.
i think the opposite, the error accumulation thing is basically the daily experience of using LLMs.
As for the premise that models cant self correct that's not the argument i've ever seen, transformers have global attention across the context window. It's that their prediction abilities are increasingly poor as generation goes on. Is anyone having a different experience than that?
Everyone doing some form of "prompt engineering" whether with optimized ML tuning, whether with a human in the loop, or some kind of agentic fine tuning step, runs into perplexity errors that get worse with longer contexts in my opinion.
There's some "sweet spot" for how long of a prompt to use for many use cases, for example. It's clear to me that less is more a lot of the time
Now will diffusion fare significantly better on error is another question. Intuition would guide me to think more flexiblity with token-rewriting should enable much greater error correction capabilities. Ultimately as different approaches come online we'll get PPL comparables and the data will speak for itself
I'm not talking about the fine tuning that make them side with the user even when they are wrong (anyway, this is less and less common now compared to the past, but anyway it's a different effect). I'm referring if in the template you make the assistant reply starting with wrong words / directions, and the LLM finds a way to say what it really meant saying "wait, actually I was wrong" or other sentences that allow it to avoid following the line.