| That... still doesn't explain anything to me? Like, say the embedding of a patch was just a vector (a, b, c, d). To have the attention layer "understand" position, you could just concatenate the patch's position to get (a, b, c, d, x, y). I understand that. What I don't understand is: - Why on earth are cosines involved? - What does that mean: > During training, these embeddings converge into vector spaces where they show high similarity to their neighboring position embeddings — particularly those sharing the same column and row Does it just mean that the network learns that "x = 1" is similar to "x = 2"? Because presumably, that's not very valuable by itself, is it? It's something you could easily hardcode. Presumably there's a step where the network goes "if there is pattern A and pattern B and both patterns have a short distance between them in the positional embedding then we have C" where A B and C are neural network magic... but your article is explaining that step, I don't understand the explanation. |