Hacker News new | ask | show | jobs
by psb217 637 days ago
"The difference between LLMs and other kinds of predictive models, or humans, is that those kinds of systems do not produce their output one token at a time, but all in one go, so their error basically stays constant." -- This is a big, unproven assumption. Any non-autoregressive model can be trivially converted to an autoregressive model by: (i) generating a full output sequence, (ii) removing all tokens except the first one, (iii) generating a full-1 output sequence conditioned on the first token. This wraps the non-autoregressive model in an "MPC loop", thereby converting it to an autoregressive model where per-token error is no greater than that of the wrapped non-AR model. The explicit MPC planning behavior might reduce error per token compared to current naive applications of AR transformers, but the MPC-wrappped model is still an AR model, so the problem is not AR per se.

LeCun's argument has some decent points, eg, allocating compute per token based solely on location within the sequence (due to increasing cost of attention ops for later locations) is indeed silly. However, the points about AR being unavoidably flawed due to exponential divergence from the true manifold are wrong and lazy. They're not wrong because AR models don't diverge, they're wrong because this sort of divergence is also present in other models.

2 comments

The loop itself is claimed to be the problem. It doesn't matter whether you use an AR or non-AR model. They both have a certain error probability that gets amplified in each iteration.
The per token error of the non-AR model wrapped with MPC is no higher than the per token error of the non-AR model without MPC. Likelihood of the entire sequence being off the true data manifold is just one minus the product of the per token errors, whether or not you're running with the MPC loop. Ie, wrapping the non-AR model in an MPC loop and thereby converting it to an AR model (with a built-in planning mechanism) doesn't increase its probability of going off track.

Per token error compounding over sequence length happens whether or not the model's autoregressive. The way in which per token errors correlate across a sequence might be more favorable wrt probability of producing bad sequences if you incorporate some explicit planning mechanism -- like the non-AR model wrapped in an MPC loop, but that's a more subtle argument than LeCun makes.

Yes. Also "other kinds of predictive models" in my comment refers to models other than generative language models, e.g. image classifiers or regression models etc. Those don't generate tokens, they output labels and the error of the labeling is constant (well, within error bounds). This was in response to OP's comment about "all prediction machines that make errors."
Could the argument be rescued by some additional assumptions?

I agree with, and have previously also stated, the point you make there about “any non-auto-regressive model can be converted into an equivalent auto-regressive model by […]”, but, if one imposes additional restrictions on e.g. computation time, or something like that, I think that construction no longer works.

Well, of course there are some additional assumptions which would rescue the argument, so I guess my real question is whether there’s some combination of extra assumptions which both rescue the argument, and actually result in it being interesting.

If one makes the assumptions that there is a positive common lower bound on the probability of each token being incorrect assuming each previous token is correct, and that if any token is incorrect, then the whole generated text is incorrect, then of course the argument goes through, though the assumption doesn’t necessarily seem very likely.

Then, if we apply the construction, you mentioned to a text generation process with a low enough probability of error, then by the contrapositive, there cannot be an especially high common lower bound on the probability of error per token.

[“edit” prior to posting: I notice that at this point I started using symbols as if I was going to start doing actual algebraic manipulations, but did not actually do any algebraic manipulations which would justify the use of said symbols. I think what I wrote below would be clearer if I had just used words. Unfortunately I don’t want to take the time to rewrite it. I apologize for introducing formalisms without having a good reason to do so.]

If we have the assumption that there is a procedure with error rate < epsilon(x) for generating an entire text response of length l(x), and which can be computed within time t(x), the construction gives an autoregressive method which has error rate less than epsilon(x) for the entire text, and doesn’t have an error rate higher than epsilon’(x) for all of the tokens, and runs in time t’(x) per token (err… I guess it should actually vary between the tokens in the generated string… depends on details I guess), where epsilon’(x) and t’(x) can be computed based on epsilon(x) and t(x) and based on how the construction works,

and epsilon’(x) will be much smaller than epsilon(x), while t’(x) l(x) >> t(x) (at least, assuming l(x) is somewhat large).

So, that particular construction does not preclude the possibility that there is no algorithm that works auto-regressively and which both has an error rate(for overall generated text) as low as [the error rate for some non-auto-regressive model that runs quickly enough], and which runs quickly enough .

If there are cryptographically secure families of hashing functions (in the sense of, asymptotically in the size of the hash length, while the hash can be computed in polynomial time, finding preimages or collisions cannot be done in polynomial time) it seems that there should probably be functions from strings to strings which can be computed in time bounded above by some polynomial, but which can’t be computed autoregressively in time bounded above by a polynomial of the same degree.

(So like, maybe it can be computed in time 5n^4 when not autoregressive, but needs at least 2n^5 time to do auto regressively)

(I’m imagining something like, “compute a string of the form ‘hash(y), y’ where y is the result of some computation done on the input which takes a polynomial amount of time to compute from the input. So, the easiest way to compute this would be to compute y and the compute hash(y). So, to do this auto-regressively, it would need to compute y again for each token in the hash.)

Of course, a single factor of n might not be that compelling, and appealing to strong hashing functions is probably trying to kill a fly with a sledgehammer(probably there are arguments that work as well without assuming this), but it’s what came to mind.

Perhaps one could do something like this to show that for some problems, any auto-regressive solution that has certain runtime bounds, will have some positive lower bound on the error rate per token?

I think it would be hard to make a solid argument that AR or non-AR is strictly better wrt full sequence error rates, whether or not we place constraints on compute, memory, etc. I'd guess that there's some intrinsic form of complexity inherent to any particular distribution of sequences which requires spending at least some amount of compute to achieve sequence generation error less than some epsilon. I'd also guess that AR and non-AR models could both achieve this bound in principle, though maybe it's practically harder with one or the other. It would be interesting to formally characterize this sort of complexity, but that's above my analytical pay grade.

The hash function example is interesting. I think the model could compute y prior to outputting any tokens and then output the `hash(y), y' sequence deterministically. In architectures like transformers, all the compute in earlier steps can be reused in later steps via attention, so it wouldn't be necessary to recompute y at each step as long as the model commits to a given y up front before starting to generate hash(y).

Ah, yeah, I guess that probably is true of transformers in practice. I was thinking about something which strictly takes in a sequence of tokens and outputs a (possibly 1-hot) probability distribution over all possible next tokens. Such a thing running autoregressively would have to recompute y each time. But, if intermediate computations are cached, as with transformers in practice, then this isn’t necessary.