| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by whimsicalism 808 days ago
	How is attention basically routing?

2 comments

visarga 808 days ago

It routes values based on linear combinations taken from the attention map.

link

whimsicalism 808 days ago

But all of those values are created using an MLP with the same parameters, so there is no routing to different parameters.

link

visarga 807 days ago

You have to look at it as a sequence of time steps which can interact. You can implement this interaction in many ways, such as transformer, mamba, rwkv or mlp-mixer. But the purpose is always to allow communication across time.

You use three distinct linear projections, one for queries, one for keys and one for values. From Q and K you compute the attention matrix A, and using A you construct linear combinations from V. But depending on A, for example for a token V_i there might be input from two other tokens, V_j or V_k, so information is moved between the tokens.

link

pizza 808 days ago

Think of it like an edge flow matrix

link

whimsicalism 808 days ago

That doesn't clarify it for me. The same parameters are being used for every layer for every token. Yes, there is this differentiable lookup in attention like in MoE - but routing is about more than just differentiable lookup, it is about selecting on parameters not state.

link