Hacker News new | ask | show | jobs
by whimsicalism 808 days ago
But all of those values are created using an MLP with the same parameters, so there is no routing to different parameters.
1 comments

You have to look at it as a sequence of time steps which can interact. You can implement this interaction in many ways, such as transformer, mamba, rwkv or mlp-mixer. But the purpose is always to allow communication across time.

You use three distinct linear projections, one for queries, one for keys and one for values. From Q and K you compute the attention matrix A, and using A you construct linear combinations from V. But depending on A, for example for a token V_i there might be input from two other tokens, V_j or V_k, so information is moved between the tokens.