| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by binarymax 843 days ago
	I hadn’t heard of Mamba before reading this article, and I was wondering if anyone has tried setting importance of a token as a TF-IDF or BM25 lookup. Requires a first pass to construct the token index but otherwise it seems like it would address the big issue that all these architectures have - they don’t know how “important” a token is. Interestingly this seems to be the crux of Mamba - deciding what tokens to forget! EMA other treats all tokens equally at sequence time. What if the tokens were weighted beforehand and the weights were passed as an attention mechanism? I wonder if anyone has tried something like this.

2 comments

halflings 843 days ago

The importance (e.g. attention) needs to be dynamic, e.g. one token will be important to some other tokens but not others.

tf-idf and similar heuristics are what we were using before attention came along, e.g. tf-idf weighted bag-of-words representation of word2vec embeddings. That approaches fails in so many cases.

link

binarymax 843 days ago

Attention in transformers works because over time the model learns token importance based on frequency and context.

If you don’t have attention and need a fast substitute for “forgetting” non important tokens, then BM25 is an intuitive hypothesis.

link

curious_cat_163 843 days ago

To use your metaphor, TF-IDF will result in ‘fixed’ weights.

Attention makes it so that the weights of each token can be different in each sequence of tokens. Same token gets different weights depending on who its ‘neighbors’ in the sequence end up being.

This property allows the models to solve a variety of natural language problems and gets ‘used’ by the model to express context-aware dependencies.

link

littlestymaar 843 days ago

Given that GP explicitly said “if you don't have attention”, and we're in a thread about a language model whose main characteristics is not to use attention, I don't understand why you insist in talking about attention …

link

curious_cat_163 843 days ago

I mean, if we are going to get past attention (very much on board with the idea!), then it might help to know what it is really contributing to a model.

My response was trying to clarify some confusion.

I am all for alternatives to attention. I don’t think BM25 cuts it. I don’t think anything that samples tokens based on BM25 weights (the idea in this subthread) would cut it.

link

binarymax 843 days ago

What confusion? I know exactly how BM25 works and how Transformers work. I stated a hypothesis and asked if anyone has tried it. You say it won’t work. That’s just your opinion. Do you have proof or evidence? This is science. Dismissal of ideas without evidence goes against scientific principles.

link

nelsondev 843 days ago

Not exactly related, but in the same vein - Deep Impact - deep learning to find term impacts in the context of their document.

https://arxiv.org/abs/2104.12016

link