|
|
|
|
|
by Buttons840
1155 days ago
|
|
> Alternatively, these embeddings can be concatenated horizontally to our matrix: this guarantees the positional information is kept entirely separate from the linguistic (at the cost of having a larger model dimension). Yes, the entire description is helpful, but I especially appreciate this validation that concatenating the position encoding is a valid option. I've been thinking a lot about aggregation functions, usually summation since it's the most basic aggregation function. After adding the token embedding and the positional encoding together, it seems information has been lost, because the resulting sum cannot be separated back into the original values. And yet, that seems to be what they do in most transformers, so it must be worth the trade-off. It reminds me of being a kid, when you first realize that zipping a file produces a smaller file and you think "well, what if I zip the zip file?" At first you wonder if you can eventually compress everything down to a single byte. I wonder the same with aggregation / summation, "if I can add the position to the embedding, and things still work, can I just keep adding things together until I have a single number?" Obviously there are some limits, but I'm not sure where those are. Maybe nobody knows? I'm hoping to study linear algebra more and perhaps I will find some answers there? |
|