|
|
|
|
|
by jsenn
853 days ago
|
|
This was really helpful, but only discusses linear operations, which obviously can’t be the whole story. From the paper it seems like the discretization is the only nonlinear step—in particular the selection mechanism is just a linear transformation. Is that right? How important is the particular form of the nonlinearity? EDIT: from looking at the paper, it seems like even though the core state space model/selection mechanism is linear (except for discretization?), they incorporate a nonlinearity in the full “mamba block”, which is stacked up with residual connections and layer norm just like in a transformer. They describe this as combining a linear attention and an MLP into a single step, rather than alternating attention and MLP as in a transformer. |
|