|
|
|
|
|
by pama
843 days ago
|
|
A dynamic gate is a pretty distinct feature from previous SSM architectures in my opinion. In a sense, the overall fundamental architecture of mamba is still that of the transformer but with attention replaced by an SSM with dynamic gating. All of deep learning uses closely related ideas, but the SSM class of models took advantage of stability guarantees from integrators in control theory and created a class of RNN that don’t have to worry about exploding gradients. Mamba is one of the ways to make these SSM models much more expressive. |
|
These models aren't really RNNs- they have only a linear gate which cannot depend on previous tokens at this layer, so they cant update their state in a way which depends on the current state very much.