Hacker News new | ask | show | jobs
by visarga 808 days ago
It routes values based on linear combinations taken from the attention map.
1 comments

But all of those values are created using an MLP with the same parameters, so there is no routing to different parameters.
You have to look at it as a sequence of time steps which can interact. You can implement this interaction in many ways, such as transformer, mamba, rwkv or mlp-mixer. But the purpose is always to allow communication across time.

You use three distinct linear projections, one for queries, one for keys and one for values. From Q and K you compute the attention matrix A, and using A you construct linear combinations from V. But depending on A, for example for a token V_i there might be input from two other tokens, V_j or V_k, so information is moved between the tokens.