|
|
|
|
|
by cscheid
1151 days ago
|
|
Interesting paper. I have a question about the claim in 6.2 that attention matrices are SPD, if you don't mind my asking. It seems to me that accepting the empirical result that the eigenvalues are positive isn't enough to get a Fourier Transform interpretation. Specifically, I don't understand the assumption that all attention matrices are symmetric. (I'm sure you know that positive eigenvalues are not enough by themselves, but for other folks reading, [[1 1/2] [1/3 1]] is a simple concrete example.) Consider Fig. 17 here: https://lilianweng.github.io/posts/2018-06-24-attention/ (this is Fig.1 in Attention is all you need). I understand that you get symmetric attention matrices for the self-attention matrix in the input stream, as well as the masked attention matrix in the output stream (the first block). But I don't understand how you claim symmetry for the final attention mechanism that combines input and output. And if you don't get symmetry, you don't get the Fourier Transform interpretation and all the nice algebra that follows. |
|