| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nil-sec 1171 days ago

Feedforward: y=Wx

Attention: y=W(x)x

W is Matrix, x & y Are vectors. In the second case, W is a function of the input.

2 comments

Sai_ 1171 days ago

You must be from a planet with very long years!

There is no way I can even begin to digest what you have said in your comment.

link

nil-sec 1171 days ago

Sorry maybe I should have added more explanation. One way to think about attention, which is the main distinguishing element in a transformer, is as an adaptable matrix. A feedforward layer is a matrix with static entries that do not change at inference time (only during training). The attention mechanism offers a way to have adaptable weight matrices at inference time (this is implemented by using three different matrices, K,Q & V called keys query and value in case you want to dig deeper).

link

oneearedrabbit 1171 days ago

I think in your notation it should have been:

y=Wx_0

y=W(x)x_0

link

nil-sec 1171 days ago

I guess I was more thinking about self attention, so yes. The more general case is covered by your notation!

link