Hacker News new | ask | show | jobs
by just_a_quack 1130 days ago
There's not. The positional encodings are generated using sines and cosines such that any offset in position can be described as a linear function on the original position. Using the DFT here would not make sense as the positional encodings are fixed anyway and during inference this method generalizes nicely because of the geometric progression created by the arguments of the positional encoding functions.
1 comments

There isn't a DFT directly, it's a more obvious statement here. The circulant matrix (linear graph of words) always has the same eigenvectors and is diagonalized via DFT.

The PE in original Viswani is based on this, they just didn't put in all the details. So effectively the model gets hints from the PE that it's a linear graph because these are the eigenvectors.