Very cool ive read this line of paper originating from hippo, s4, hyena, mamba etc but can someone please explain how this isnt just an RNN/LSTM variant??
Its latent space transition is linear, instead of nonlinear, so there's a more parallelizable algorithm for advancing time in it. This makes it much more efficient to train and do inference with in GPUs.
The way it keeps all the representation power of LSTMs is by having the transition vary with the input (but still be linear).
Thanks thats helpful. One place where the parallelizability of this method falls short of the transformer is not being able to pack multiple varying length examples into the same array during training with block diagonal attention pattern. If I understand correctly thats not possible with this architecture and its an important practical concern in large scale transformer training.
The way it keeps all the representation power of LSTMs is by having the transition vary with the input (but still be linear).