Hacker News new | ask | show | jobs
by Straw 843 days ago
The SSMs papers and blogs always have unnecessarily complicated explanations. At this point I almost wonder if its to hide how simple the underlying algorithms are, or to make them seem fancy.

SSMs are doing exponentially weighted moving averages (EMA). That's it- to summarize the past, you take an EMA of a variable output at each time step. Mamba changes one key thing- instead of decaying the past by a fixed amount each step as in a constant-time EMA, we have another output which decides how much to forget, or equivalently, how much 'time' has passed since the last observation in our EMA.

All of the matrix equations, continuous time, discretization, etc, will end up with a dynamic-forgetting EMA as I describe above. This also makes the benefits and limitations clear- finite state size, has to decide at a given layer what to forget before it sees the past at that layer.

4 comments

Are there any fundamental differences between Mamba, Retnet and RWKV, or are they all variants of this same architecture?
No, all of these use the same fundamental architecture with minor tweaks, such as the dynamic gate for mamba or an outer product paramterization of the values for RWKV-v5
A dynamic gate is a pretty distinct feature from previous SSM architectures in my opinion. In a sense, the overall fundamental architecture of mamba is still that of the transformer but with attention replaced by an SSM with dynamic gating. All of deep learning uses closely related ideas, but the SSM class of models took advantage of stability guarantees from integrators in control theory and created a class of RNN that don’t have to worry about exploding gradients. Mamba is one of the ways to make these SSM models much more expressive.
Its distinct, but not very- its an EMA without assuming uniform time. The stability of EMA has nothing to do with integrators in control theory and neither do these models.

These models aren't really RNNs- they have only a linear gate which cannot depend on previous tokens at this layer, so they cant update their state in a way which depends on the current state very much.

That might explain the motivation for why the Δ variable is used and varied; but not the "Selectivity", which the article says is expressed by how the matrices B and C vary while consuming input.

Something I've noticed is that B, C and Δ depend only on the current token. See this: https://www.kolaayonrinde.com/blog/images/mamba/ssm_algorith... -- Another thing is that I've noticed that the definition of "SSM" in the image I've linked to is apparently recursive. This is also in the Arxiv paper. Strange.

+1 though for making me go back to the article and read it more carefully! +1 also to the article.

OK, I've noticed that the pseudo-code above is vectorised, and so there's no recursion. The SSM function is actually described at the start of the paper, and an efficient hardware-aware implementation is suggested in section 3: https://arxiv.org/ftp/arxiv/papers/2312/2312.00752.pdf
I hadn’t heard of Mamba before reading this article, and I was wondering if anyone has tried setting importance of a token as a TF-IDF or BM25 lookup. Requires a first pass to construct the token index but otherwise it seems like it would address the big issue that all these architectures have - they don’t know how “important” a token is. Interestingly this seems to be the crux of Mamba - deciding what tokens to forget! EMA other treats all tokens equally at sequence time. What if the tokens were weighted beforehand and the weights were passed as an attention mechanism? I wonder if anyone has tried something like this.
The importance (e.g. attention) needs to be dynamic, e.g. one token will be important to some other tokens but not others.

tf-idf and similar heuristics are what we were using before attention came along, e.g. tf-idf weighted bag-of-words representation of word2vec embeddings. That approaches fails in so many cases.

Attention in transformers works because over time the model learns token importance based on frequency and context.

If you don’t have attention and need a fast substitute for “forgetting” non important tokens, then BM25 is an intuitive hypothesis.

To use your metaphor, TF-IDF will result in ‘fixed’ weights.

Attention makes it so that the weights of each token can be different in each sequence of tokens. Same token gets different weights depending on who its ‘neighbors’ in the sequence end up being.

This property allows the models to solve a variety of natural language problems and gets ‘used’ by the model to express context-aware dependencies.

Given that GP explicitly said “if you don't have attention”, and we're in a thread about a language model whose main characteristics is not to use attention, I don't understand why you insist in talking about attention …
I mean, if we are going to get past attention (very much on board with the idea!), then it might help to know what it is really contributing to a model.

My response was trying to clarify some confusion.

I am all for alternatives to attention. I don’t think BM25 cuts it. I don’t think anything that samples tokens based on BM25 weights (the idea in this subthread) would cut it.

Not exactly related, but in the same vein - Deep Impact - deep learning to find term impacts in the context of their document.

https://arxiv.org/abs/2104.12016

Is this analogous to digital filters, where Transformers are the FIR filters that operate on the history of input, and IIR filters, which take past inputs into account with an exponentially decaying importance?