Hacker News new | ask | show | jobs
by versteegen 773 days ago
> How does it know what the next street or neighbourhood it should traverse in each step without a pathfinding algo?

Because Transformers are 'AI-complete'. Much is made of (decoder-only) transformers being next token predictors which misses the truth that large transformers can "think" before they speak: there are many layers in-between input and output. They can form a primitive high-level plan by a certain layer of a certain token such as the last input token of the prompt, e.g. go from A to B via approximate midpoint C, and then refer back to that on every following token, while expanding upon it with details (A to C via D): their working memory grows with the number of input+output tokens, and with each additional layer they can elaborate details of an earlier representation such as a 'plan'.

However the number of sequential steps of any internal computation (not 'saved' as an output token) is limited by the number of layers. This limit can be worked around by using chain-of-thought, which is why I call them AI-complete.

I write this all hypothetically, not based on mechanistic interpretability experiments.

1 comments

I like your interpretation, but how would they refer back to a plan if it isn’t stored in the input/output? Wouldn’t this be lost/recalculated with each token?
The internal state at layer M of token N is available at every following token > N and layer > M via attention heads. Transformed by a matrix but a very direct lookup mechanism. The state after the final attention layer is not addressable in this way, but it immediately becomes the output token which is of course accessible.

Note also that sequential computations such as loops translate nicely to parallel ones, e.g. k layers can search the paths of length k in a graph, if each token represents one node. But since each token can only look backwards, unless you're searching a DAG you'd also have to feed in the graph multiple times so the nodes can see each other. Hmm... that might be a useful LLM prompting technique.

But is this lookup mechanism available from one token prediction to the next? I’ve heard conflicting things, with others saying that transformers are stateless and therefore don’t share this information across prediction steps. I might be misunderstanding something fundamental.
Yes, attention (in transformer decoders) looks backwards to internal state at previous tokens. (In transform encoders like in BERT it can also look forwards.) When they said "stateless" I think they meant that you can recompute the state from the tokens, so the state can be discarded at any time: the internal state is entirely deterministic, it's only the selection of output tokens that involves random sampling. What's also a critical feature of transformers is that you can compute the state at layer N for all tokens in parallel, because it depends only on layer N-1 for the current and all previous tokens, not on layer N for the previous token as in LSTMs or typical RNNs. The whole point of the transformer architecture is to allow that parallel compute, at the cost of directly depending on every previous token rather than just the last.

So if you wished you could implement a transformer by recomputing everything on every token. That would be incredibly inefficient. However, if you're continuing a conversation with an LLM you likely would recompute all the state for all tokens on each new user input, because the alternative is to store all that state in memory until the user gets back to you again a minute later. If you have too many simultaneous users you won't have enough VRAM for that. (In some cases moving it out of VRAM temporarily might be practical.)