|
|
|
|
|
by halflings
840 days ago
|
|
The importance (e.g. attention) needs to be dynamic, e.g. one token will be important to some other tokens but not others. tf-idf and similar heuristics are what we were using before attention came along, e.g. tf-idf weighted bag-of-words representation of word2vec embeddings. That approaches fails in so many cases. |
|
If you don’t have attention and need a fast substitute for “forgetting” non important tokens, then BM25 is an intuitive hypothesis.