| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by opprobium 916 days ago
	Re 3) Even if you don't care about long context length, Mamba is much cheaper per token of auto-regressive output. Each token has to only compute the next step of a linear RNN, the transformer has to attend back over all previous outputs, which rapidly grows in cost and memory.