Hacker News new | ask | show | jobs
by Buttons840 1155 days ago
> Alternatively, these embeddings can be concatenated horizontally to our matrix: this guarantees the positional information is kept entirely separate from the linguistic (at the cost of having a larger model dimension).

Yes, the entire description is helpful, but I especially appreciate this validation that concatenating the position encoding is a valid option.

I've been thinking a lot about aggregation functions, usually summation since it's the most basic aggregation function. After adding the token embedding and the positional encoding together, it seems information has been lost, because the resulting sum cannot be separated back into the original values. And yet, that seems to be what they do in most transformers, so it must be worth the trade-off.

It reminds me of being a kid, when you first realize that zipping a file produces a smaller file and you think "well, what if I zip the zip file?" At first you wonder if you can eventually compress everything down to a single byte. I wonder the same with aggregation / summation, "if I can add the position to the embedding, and things still work, can I just keep adding things together until I have a single number?" Obviously there are some limits, but I'm not sure where those are. Maybe nobody knows? I'm hoping to study linear algebra more and perhaps I will find some answers there?

2 comments

One thing to bear in mind is that these embedding vectors are high dimensional, so that it is entirely possible that the token embedding and position embedding are near-orthogonal to one another. As a result, information isn't necessarily lost.
The information might be formally lost for the given token, but remember that transformers train on huge amounts of data.

The (absolute) positional encoding is an arbitrary but fixed bias (push into some direction). The word "cat" at position 2 is pushed into the 2-direction. This "cat" might be different from a "cat at position 3, such that the model can learn about this distinction.

Nevertheless, the model could also still learn to keep "cats" at all positions together, for instance such "cats" are more similar to "cats" than to "dogs" at any position. More importantly, for some words, the model might learn that a word at the beginning of the sequence should have an entirely different meaning than the same word at the end of the sequence.

In other words, since the embeddings are a free parameter to be learned (usually both as embeddings, and weight-tied in the head), there isn't any loss in flexbility. Rather, the model can learn how much mixing is required or whether the information added by the positional embedding should be seperable (for instance by making embeddings linearly independent otherwise)

If you concat, you carry along an otherwise useless and static dimension, and mixing it into the embeddings would be the very first thing the model learns in layer 1.