|
|
|
|
|
by binarymax
841 days ago
|
|
Attention in transformers works because over time the model learns token importance based on frequency and context. If you don’t have attention and need a fast substitute for “forgetting” non important tokens, then BM25 is an intuitive hypothesis. |
|
Attention makes it so that the weights of each token can be different in each sequence of tokens. Same token gets different weights depending on who its ‘neighbors’ in the sequence end up being.
This property allows the models to solve a variety of natural language problems and gets ‘used’ by the model to express context-aware dependencies.