| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by PoignardAzur 1298 days ago

That... still doesn't explain anything to me?

Like, say the embedding of a patch was just a vector (a, b, c, d). To have the attention layer "understand" position, you could just concatenate the patch's position to get (a, b, c, d, x, y). I understand that.

What I don't understand is:

- Why on earth are cosines involved?

- What does that mean:

> During training, these embeddings converge into vector spaces where they show high similarity to their neighboring position embeddings — particularly those sharing the same column and row

Does it just mean that the network learns that "x = 1" is similar to "x = 2"? Because presumably, that's not very valuable by itself, is it? It's something you could easily hardcode.

Presumably there's a step where the network goes "if there is pattern A and pattern B and both patterns have a short distance between them in the positional embedding then we have C" where A B and C are neural network magic... but your article is explaining that step, I don't understand the explanation.