Hacker News new | ask | show | jobs
by opprobium 916 days ago
Re 3) Even if you don't care about long context length, Mamba is much cheaper per token of auto-regressive output. Each token has to only compute the next step of a linear RNN, the transformer has to attend back over all previous outputs, which rapidly grows in cost and memory.