| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by agoose77 480 days ago

I am not an expert by _any_ means, but to provide _some_ intuition — self-attention is ultimately just a parameterised token mixer (see https://medium.com/optalysys/attention-fourier-transforms-a-...) i.e. each vector in the output depends upon the corresponding input vector transformed by some function of all the other input vectors.

You can see conceptually how this is similar to a convolution with some simplification, e.g. https://openreview.net/pdf?id=8l5GjEqGiRG

Convolutions are often used in contexts where you want to account for global state in some way. - https://openreview.net/pdf?id=8l5GjEqGiRG