| HN Mirror

Yes, attention (in transformer decoders) looks backwards to internal state at previous tokens. (In transform encoders like in BERT it can also look forwards.) When they said "stateless" I think they meant that you can recompute the state from the tokens, so the state can be discarded at any time: the internal state is entirely deterministic, it's only the selection of output tokens that involves random sampling. What's also a critical feature of transformers is that you can compute the state at layer N for all tokens in parallel, because it depends only on layer N-1 for the current and all previous tokens, not on layer N for the previous token as in LSTMs or typical RNNs. The whole point of the transformer architecture is to allow that parallel compute, at the cost of directly depending on every previous token rather than just the last.

So if you wished you could implement a transformer by recomputing everything on every token. That would be incredibly inefficient. However, if you're continuing a conversation with an LLM you likely would recompute all the state for all tokens on each new user input, because the alternative is to store all that state in memory until the user gets back to you again a minute later. If you have too many simultaneous users you won't have enough VRAM for that. (In some cases moving it out of VRAM temporarily might be practical.)