|
|
|
|
|
by jaidhyani
1161 days ago
|
|
The way the article presents this is misleading. The attention mechanism builds a new vector as a linear combination of other vectors, but after the first layer these have also all been altered by passing through a transformer layer so it makes less sense to talk about "other tokens" in most cases (it becomes increasingly inaccurate the deeper into the model you go). It's also not really moving closer so much as adding, and what it's adding isn't the embedding-derived-vector but a transform of the embedding-derived-vector after it's been projected into a lower-dimensional-space for that attention head. It would be more accurate to say that it's integrating information stored in other vectors-derived-from-token-embeddings-at-some-point (which can also entail erasing information) |
|
E.g. the vector for "bank" is mid-way between the geographical and financial meaning, "bank + money" is closer while "bank + river" if further away.