Hacker News new | ask | show | jobs
by sailingparrot 236 days ago
> the token-at-a-time approach of the in-vogue LLMs. Speaking for myself, I don't generate words one a time based on previously spoken words

Autoregressive LLMs don't do that either actually. Sure with one forward pass you only get one token at a time, but looking at what is happening in the latent space there are clear signs of long term planning and reasoning that go beyond just the next token.

So I don't think it's necessarily more or less similar to us than diffusion, we do say one word at a time sequentially, even if we have the bigger picture in mind.

5 comments

To take a simple example, let’s say we ask an autoregressive model a yes/no factual question like “is 1+1=2?”. Then, we force the LLM to start with the wrong answer “No, “ and continue decoding.

An autoregressive model can’t edit the past. If it happens to sample the wrong first token (or we force it to in this case), there’s no going back. Of course there can be many more complicated lines of thinking as well where backtracking would be nice.

“Reasoning” LLMs tack this on with reasoning tokens. But the issue with this is that the LLM has to attend to every incorrect, irrelevant line of thinking which is at a minimum a waste and likely confusing.

As an analogy, in HN I don’t need to attend to every comment under a post in order to generate my next word. I probably just care about the current thread from my comment up to the OP. Of course a model could learn that relationship but that’s a huge waste of compute.

Text diffusion solves the whole problem entirely by allowing the model to simply revise the “no” to a “yes”. Very simple.

That is precisely what autoregressive means. Perhaps you meant to write that modern LLMs are not strictly autoregressive?
I think they are distinguishing the mechanical process of generation from the way the idea exists. It’s the same as how a person can literally only speak one word at a time but the ideas might be nonlinear.
Indeed what I meant. The LLM isn’t a blank slate at the beginning of each new token during autoregression as the kv cache is there.
If so they are wrong. :) Autoregressive just means that the probability of the next token is just a function of the already seen/emitted tokens. Any "ideas that may exist" are entirely embedded in this sequence.
> entirely embedded in this sequence.

Obviously wrong, as otherwise every model would predict exactly the same thing, it would not even be predicting anymore, simply decoding.

The sequence is not enough to reproduce the exact output, you also need the weights.

And the way the model work is by attending to its own internal state (weights*input) and refining it, both across the depth (layer) dimension and across the time (tokens) dimension.

The fact that you can get the model to give you the exact same output by fixing a few seeds, is only a consequence of the process being markovian, and is orthogonal to the fact that at each token position the model is “thinking” about a longer horizon than the present token and is able to reuse that representation at later time steps

Well, that an autoregressive model has parameters does not mean it is not autoregressive. LLMs are not Markovian.
At no point have I argued that LLMs aren’t autoregressive, I am merely talking about LLMs ability to reason across time steps, so it seems we are talking past each other which won’t lead anywhere.

And yes, LLM can be studied under the lens of Markov processes: https://arxiv.org/pdf/2410.02724

Have a good day

If a process is necessary for performing a task, (sufficiently-large) neural networks trained on that task will approximate that process. That doesn't mean they're doing it anything resembling efficiently, or that a different architecture / algorithm wouldn't produce a better result.
I’m not arguing about efficiency though ? Simply saying next token predictors cannot be thought of as actually just thinking about the next token with no long term plan.
They rebuild the "long term plan" anew for every token: there's no guarantee that the reconstructed plan will remain similar between tokens. That's not how planning normally works. (You can find something like this every time there's this kind of gross inefficiency, which is why I gave the general principle.)
> They rebuild the "long term plan" anew for every token

Well no, there is attention in the LLM which allows it to look back at it's "internal thought" during the previous tokens.

Token T at layer L, can attend to a projection of the hidden states of all tokens < T at L. So its definitely not starting anew at every token and is able to iterate on an existing plan.

Its not a perfect mechanism for sure, and there is work to make LLMs able to carry more information forward (e.g. feedback transformers), but they can definitely do some of that today.

This isn't the same as planning. Consider what happens when tokens from another source are appended.
I don't follow how this relates to what we are discussing. Autoregressive LLMs are able to plan within a single forward pass and are able to look back at their previous reasoning and do not start anew at each token like you said.

If you append tokens from another source, like in a turn base conversation, then the LLM will process all the new appended tokens in parallel while still being able to look back at it's previous internal state (and thus past reasoning/planning in latent space) from the already processed tokens, then will adjust the plan based on the new information.

What happens to you as a human if you come up with a plan with limited information and new information is provided to you?

Actually, due to using causal (masked) attention, new tokens appended to the input don't have any effect on what's calculated internally (the "plan") at earlier positions in the input, and a modern LLM therefore uses a KV cache rather than recalculating at those earlier positions.

In other words, the "recalculated" plan will be exactly the same as before, just extended with new planning at the position of each newly appended token.

You can violate the plan in the sampler by making an "unreasonable" choice of next token to sample (eg by raising the temperature.) So if it does stick to the same plan, it's not going to be a very good one.
Yeah.

Karpathy recently referred to LLMs having more "working memory" than a human, apparently referring to these unchanging internal activations as "memory", but it's an odd sort of "working memory" if you can't actually update it to reflect progress on what you are working on, or update per new information (new unexpected token having been sampled).

Right, and this is what "reasoning LLMs" work around by having explicitly labelly "reasoning tokens".

This lets them "save" the plan between tokens, so when regenerating the new token it is following the plan.

It also doesn’t mean they’re doing it inefficiently.
I read this to mean “just because the process doesn’t match the problem, that doesn’t mean it’s inefficient”. But I think it does mean that. I expect we intuitively know that data structures which match the structure of a problem are more efficient than those that don’t. I think the same thing applies here.

I realize my argument is hand wavey, i haven’t defined “efficient“ (in space? Time? Energy?), and there are other shortcomings, but I feel this is “good enough” to be convincing

Example: a list of (key, value) pairs is a perfectly valid way to implement a map, and suffices. However, a more complicated tree structure, perhaps with hashed keys, is usually way more efficient, which is increasingly-noticeable as the number of pairs stored in the map grows large.
I suppose there’s something in what you’re saying, it’s just that’s it’s sorta vague and hard to parse for me. It also depends on the higher order problem space, for example: is it efficient if the problem is defined by “make something that can adapt to a problem space and solve it without manual engineering” rather than “make something with a long lead up time where you understand the problem space in advance and therefore have time to optimise the engine”. In the former, the neural network would indeed count as solving this efficiently, because of the given definition of the goal.
You're right that there is long-term planning going on, but that doesn't contradict the fact that an autoregressive LLM does, in fact, literally generate words one at a time based on previously spoken words. Planning and action are different things.
There is some long term planning going on, but bad luck when sampling the next token can take the process out of rails, so it's not just an implementation detail.