|
|
|
|
|
by jamesbriggs
1299 days ago
|
|
Hi, author of the article here. They act as a "position signal" that modifies the patch embedding. The learned signals are similar to other neighbouring position signals, and the later layers of the model will use the "similarity" between signals to identify the proximity/order of different patches. There is no explicit mechanism that tells the network to make neighboring position embeddings similar, it's just a result of the training that fortunately works and seems logical. I can definitely try to explain further if needed |
|
Like, say the embedding of a patch was just a vector (a, b, c, d). To have the attention layer "understand" position, you could just concatenate the patch's position to get (a, b, c, d, x, y). I understand that.
What I don't understand is:
- Why on earth are cosines involved?
- What does that mean:
> During training, these embeddings converge into vector spaces where they show high similarity to their neighboring position embeddings — particularly those sharing the same column and row
Does it just mean that the network learns that "x = 1" is similar to "x = 2"? Because presumably, that's not very valuable by itself, is it? It's something you could easily hardcode.
Presumably there's a step where the network goes "if there is pattern A and pattern B and both patterns have a short distance between them in the positional embedding then we have C" where A B and C are neural network magic... but your article is explaining that step, I don't understand the explanation.