|
|
|
|
|
by binarymax
843 days ago
|
|
I hadn’t heard of Mamba before reading this article, and I was wondering if anyone has tried setting importance of a token as a TF-IDF or BM25 lookup. Requires a first pass to construct the token index but otherwise it seems like it would address the big issue that all these architectures have - they don’t know how “important” a token is. Interestingly this seems to be the crux of Mamba - deciding what tokens to forget! EMA other treats all tokens equally at sequence time. What if the tokens were weighted beforehand and the weights were passed as an attention mechanism? I wonder if anyone has tried something like this. |
|
tf-idf and similar heuristics are what we were using before attention came along, e.g. tf-idf weighted bag-of-words representation of word2vec embeddings. That approaches fails in so many cases.