|
The SSMs papers and blogs always have unnecessarily complicated explanations. At this point I almost wonder if its to hide how simple the underlying algorithms are, or to make them seem fancy. SSMs are doing exponentially weighted moving averages (EMA). That's it- to summarize the past, you take an EMA of a variable output at each time step. Mamba changes one key thing- instead of decaying the past by a fixed amount each step as in a constant-time EMA, we have another output which decides how much to forget, or equivalently, how much 'time' has passed since the last observation in our EMA. All of the matrix equations, continuous time, discretization, etc, will end up with a dynamic-forgetting EMA as I describe above. This also makes the benefits and limitations clear- finite state size, has to decide at a given layer what to forget before it sees the past at that layer. |