|
|
|
|
|
by swyx
916 days ago
|
|
things I'd like a non-ML-researcher explanation of about Mamba: 1. what is the overall insight of state space models beyond transformers? (i know this is somewhat covered in the paper but still a bit inaccessible) 2. what was the incremental innovation/result that is making Mamba more successful/interesting than its predecessors? (S4, H3, Monarch etc) 3. what are the implications beyond subquadratic scaling of context? say if i don't really care about context length > 100k tokens. what other benefits are there - for example, is Mamba potentially more compute-efficient to train for a similar size of model/dataset? just offering 3 prompts for knowledgeable people to drop some alpha |
|
The overall insight of Mamba is to solve a longstanding problem with state space models. They are good at compressing the input context, but the compression of input into a hidden state erases information needed to make use of the context effectively as Transformers do.
Their solution to this problem is to create what they call a selection mechanism. The mechanism is input-dependent, allowing the model to adjust its output at each step as the input changes. How they do this is by making a few of the state space variables input-dependent instead of input-invariant. They choose a few of the state space variables and attach linear layers and such to project the input onto the state space variable at each time step. The linear layers (etc) are obviously trained so that they know how to transform the input appropriately so that the model spits out useful output.
But making the state space variables input dependent creates a problem in terms of computation overhead. They fix the computation problem by designing a machine architecture-aware algorithm that makes the most of modern GPU memory architecture, avoiding moving things in and out of HBM as much as possible.
Tri Dao came up with Flash Attention, which is basically a way to use hardware more efficiently in a Transformer. So this is his jam 100%.
I know this doesn’t add much to understanding the paper, but hopefully it’s better than nothing.