| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by LelouBil 283 days ago

Not the original person you are replying to, but I wanted to add:

Yes, they can plan within a single forward pass like you said, but I still think they "start anew at each token" because they have no state/memory that is not the output.

I guess this is differing interpretations of the meaning of "start anew", but personally I would agree that having no internal state and simply looking back at it's previous output to form a new token is "starting anew".

But I'm also not well informed about the topic so happy to be corrected.

2 comments

sailingparrot 283 days ago

But you are missing the causal attention from your analysis. The output is not the only thing that is preserved, there is also the KV-cache.

At token 1, the model goes through, say, 28 transformer blocks, for each one of those block we save 2 projections of the hidden state in a cache.

At token 2, on top of seeing the new token, the model is now also able in each one of those 28 blocks, to look at the previously saved hidden states from token 1.

At token 3, it can see the states from token 2 and 1 etc.

However I still agree that is not a perfect information-passing mechanism because of how those model are trained (and something like feedback transformer would be better), but information still is very much being passed from earlier tokens to later ones.

link

LelouBil 283 days ago

Like an other commenter said, isn't the KV cache a performance optimization to not have to redo work that was already done ? Or does it fundamentally alter the output of the LLM, and so preserves state that is not present in the output of the LLM ?

link

sailingparrot 283 days ago

Yes, it's "just" an optimization technique, in the sense that you could not have it and end up with the same result (given the same input sequence), just much slower.

Conceptually what matters is not the kv-cache but the attention. But IMHO thinking about how the model behave during inference, when outputting one token at a time and doing attention on the kv cache is much easier to grok than during training/prefilling where the kv cache is absent and everything happens in parallel (although they are mathematically equivalent).

The important part of my point, is that when the model is processing token N, it can check it's past internal state during token 1,...,N-1, and thus "see" its previous plan and reasoning, and iterate over it, rather than just repeating everything from scratch in each token's hidden state (with caveat, explained at the end).

token_1 ──▶ h₁ᴸ ────────┐

token_2 ──▶ h₂ᴸ ──attn──┼──▶ h₃ᴸ (refines reasoning)

token_3 ──▶ h₃ᴸ ──attn──┼──▶ h₄ᴸ (refines further)

And the kv-cache makes this persistent across time, so the entire system (LLM+cache) becomes effectively able to save its state, and iterate upon it at each token, and not have to start from scratch every time.

But ultimately its a markov-chain, so again mathematically, yes, you could just re-do the full computation all the time, and end up in the same place.

Caveat: Because token N at layer L can attend to all other tokens <N but only at layer L, it only allows it to see the how the reasoning was at that depth, not how it was after a full pass, so it's not a perfect information passing mechanism, and is more pyramidal than straight line. Hence why i referenced feedback transformers in another message. But the principle still applies that information is passing through time steps.

link

nl 283 days ago

Worth noting here for others following that a single forward pass is what generates a single token.

It's correct to states the LLM starts anew for each token.

The work around for this is to pass the existing plan back into it as part of the context.

link

sailingparrot 283 days ago

You are forgetting about attention on the kv-cache, which is the mechanism that allows LLM to not start anew everytime.

link

nl 283 days ago

I mean sure, but that is a performance optimization that doesn't really change what is going on.

It's still recalculating, just that intermediate steps are cached.

link

sailingparrot 283 days ago

Isn't the ability to store past reasoning in an external system to avoid having to do the computation all over again precisely what a memory is though?

But mathematically KV-caching, instead of doing prefilling at every token is equivalent, sure. But the important part of my message was the attention.

A plan/reasoning made during the forward pass of token 0 can be looked at by subsequent (or parallel if you don’t want to use the cache) passes of token 1,…,n. So you cannot consider token n to be starting from scratch in terms of reasoning/planning as it can reuse what has already been planned in previous tokens.

If you think about inference with KV-caching, even though you are right that mathematically it’s just an optimization, it makes this behavior much more easy to reason about: the kv-cache is a store of past internal states, that the model can attend to for subsequent tokens, which allows that subsequent token internal hidden states to be more than just a repetition of what the model already reasoned about in the past.

link