I am not an expert by _any_ means, but to provide _some_ intuition — self-attention is ultimately just a parameterised token mixer (see https://medium.com/optalysys/attention-fourier-transforms-a-...) i.e. each vector in the output depends upon the corresponding input vector transformed by some function of all the other input vectors.