|
|
|
|
|
by wityl
868 days ago
|
|
They were previously parallelizable (via fft), but performed poorly on language modeling tasks. Mamba adds a dependence on the inputs that makes language modeling competitive with transformers, but that prevents using the fft approach. So they switch to a method using parallel prefix scan. |
|