I really disagree with pigeonholing it as an LLM architecture! It is much more general than that as I mentioned in another comment in this post [1] (and of course as mentioned in the original paper which you linked).
It totally mentions what it does. It takes the sentence "I have a dream that" and extends it to: "I have a dream that I will be able to see the sunrise in the morning."
It’s much more than just an LLM. The mamba architecture is often used in the backbone of an LLM but you can use it more generally as a linear-time (as opposed to quadratic-time) sequence modeling architecture (as per the original paper’s title, which is cited in the linked repo). It is much closer to a convolutional network or an RNN (it has bits of both) than to a transformer architecture. It is based off the notion of state spaces (with a twist).
I use Mamba for instance to build surrogate models of physics-based building energy models which can generate 15-min interval data for heating, cooling, electricity, and hot water usage of any building in the US from building characteristics, weather timeseries, and occupancy time series.
The Mamba application is my current research project so I haven’t published anything yet. But the basic idea is to create a latent representation of the static features, repeat the latent vector to form a time series, concatenate with the weather/occupancy time series, run through mamba layers, and bob’s your uncle. Shoot me an email (in my bio) if you would like to chat more!
I can also share my master’s thesis which is similar but using CNN layers rather than Mamba and only for monthly predictions rather than 15-min interval data. There are some other architectural differences but the basics are the same. That work is also globally robust.
As you can imagine, the current work I am doing at a much higher resolution is a big step up, and Mamba so far is working out great.
Proponents of it usually highlight it's inference performance, in particular linear scaling with the input tokens.