Hacker News new | ask | show | jobs
by jamesbriggs 1299 days ago
Hi, author of the article here.

They act as a "position signal" that modifies the patch embedding. The learned signals are similar to other neighbouring position signals, and the later layers of the model will use the "similarity" between signals to identify the proximity/order of different patches.

There is no explicit mechanism that tells the network to make neighboring position embeddings similar, it's just a result of the training that fortunately works and seems logical.

I can definitely try to explain further if needed

4 comments

That... still doesn't explain anything to me?

Like, say the embedding of a patch was just a vector (a, b, c, d). To have the attention layer "understand" position, you could just concatenate the patch's position to get (a, b, c, d, x, y). I understand that.

What I don't understand is:

- Why on earth are cosines involved?

- What does that mean:

> During training, these embeddings converge into vector spaces where they show high similarity to their neighboring position embeddings — particularly those sharing the same column and row

Does it just mean that the network learns that "x = 1" is similar to "x = 2"? Because presumably, that's not very valuable by itself, is it? It's something you could easily hardcode.

Presumably there's a step where the network goes "if there is pattern A and pattern B and both patterns have a short distance between them in the positional embedding then we have C" where A B and C are neural network magic... but your article is explaining that step, I don't understand the explanation.

I think something that can help this situation is to not use "big" machine learning words such as "patch embedding".

This is one of the issues with a lot of machine learning articles out there (not to nitpick on you, sorry); there are almost always easy and illustrative examples that you can use to break this down into a simpler explanation.

Hi! Are these positional embeddings literally made by concatenating the patch embedding with a number, then passing that through the next layer, as suggested by the figure under "Images to Patch Embeddings"?

It's the most confusing part of transformers for me. How do we train the module that creates these embeddings?

What’s the advantage of learning this positional embedding vs hardcoding one that has the desired properties of location similarity?