Hacker News new | ask | show | jobs
by visarga 915 days ago
Mamba has dual view - you can use it both as CNN and RNN. The first is used for pre-training and for preloading the prompt because it can process all tokens at once. The second is used for token generation because it is O(1) per token. Basically two models in one, inheriting both advantages. This is possible because the Structured State Space layer is linear, so you can reshape some sums and unroll recursion into a convolution the size of the input, which can be further sped up with FFT.
2 comments

As a quick point of clarification, I don't think MAMBA has a convolutional view since it drops the time invariance and is strictly linear. The authors use parallelized prefix sum to achieve some good speed up.
and which is why the speed up is proportional to context length so starting near parity then, theoretically, see 100x at 100k tokens