|
|
|
|
|
by flebron
581 days ago
|
|
All of them are vectors of embedded representations of tokens. In a transformer, you want to compute the inner product between a query (the token who is doing the attending) and the key (the token who is being attended to). An inductive bias we have is that the neural network's performance will be better if this inner product depends on the relative distance between the query token's position, and the key token's position. We thus encode each one with positional information, in such a way that (for RoPE at least) the inner product depends only on the distance between these tokens, and not their absolute positions in the input sentence. |
|