Hacker News new | ask | show | jobs
by jsenn 853 days ago
This was really helpful, but only discusses linear operations, which obviously can’t be the whole story. From the paper it seems like the discretization is the only nonlinear step—in particular the selection mechanism is just a linear transformation. Is that right? How important is the particular form of the nonlinearity?

EDIT: from looking at the paper, it seems like even though the core state space model/selection mechanism is linear (except for discretization?), they incorporate a nonlinearity in the full “mamba block”, which is stacked up with residual connections and layer norm just like in a transformer. They describe this as combining a linear attention and an MLP into a single step, rather than alternating attention and MLP as in a transformer.

1 comments

Yes you're spot on, the nonlinearities come from the full Mamba blocks, which I left out of this post for simplicity/to focus on the bigger ideas the paper introduced. You can see it marked by the "X" on the right-most part of Figure 3 in the Mamba paper: https://arxiv.org/abs/2312.00752