Sorry. You do mention a linear systems response, and that's what I meant.
In that setting, the eigenvectors work as a generalized forward and inverse fourier transform, and the eigenvalues form the transfer function you allude to in the bold sentence
"The attention mechanism’s role is the same as that of a transfer function in a linear
time-invariant system, namely it calculates the frequency response of the transformer
model,"
Specifically, it seems to me that this requires a _symmetric_ attention matrix. Which you get from the self-attention mechanisms (two of the three places where they're used in transformers), but not all of them, notably not the one that combines the output of the first two attention mechanisms (one input, and one output)
I think that the magic comes from the antipode which makes things symmetric. The Ising model is somewhat similar.
Rereading the paper, quite a bit has changed in my understanding of all this. My conclusions still stand, but some of the reasoning needs to be explained better.
In that setting, the eigenvectors work as a generalized forward and inverse fourier transform, and the eigenvalues form the transfer function you allude to in the bold sentence
"The attention mechanism’s role is the same as that of a transfer function in a linear time-invariant system, namely it calculates the frequency response of the transformer model,"
Specifically, it seems to me that this requires a _symmetric_ attention matrix. Which you get from the self-attention mechanisms (two of the three places where they're used in transformers), but not all of them, notably not the one that combines the output of the first two attention mechanisms (one input, and one output)