Hacker News new | ask | show | jobs
by logicchains 843 days ago
Are there any fundamental differences between Mamba, Retnet and RWKV, or are they all variants of this same architecture?
1 comments

No, all of these use the same fundamental architecture with minor tweaks, such as the dynamic gate for mamba or an outer product paramterization of the values for RWKV-v5
A dynamic gate is a pretty distinct feature from previous SSM architectures in my opinion. In a sense, the overall fundamental architecture of mamba is still that of the transformer but with attention replaced by an SSM with dynamic gating. All of deep learning uses closely related ideas, but the SSM class of models took advantage of stability guarantees from integrators in control theory and created a class of RNN that don’t have to worry about exploding gradients. Mamba is one of the ways to make these SSM models much more expressive.
Its distinct, but not very- its an EMA without assuming uniform time. The stability of EMA has nothing to do with integrators in control theory and neither do these models.

These models aren't really RNNs- they have only a linear gate which cannot depend on previous tokens at this layer, so they cant update their state in a way which depends on the current state very much.