Hacker News new | ask | show | jobs
by immibis 484 days ago
Transformers are completely unlike RNNs.
1 comments

There are some interesting connections between them. If you remove the softmax from the attention formula, you end up with linear attention, which has a recurrent form.

I haven't read it, but the Mamba 2 paper claims to establish a stronger connection.

* If you remove the softmax from the attention formula, you end up with linear attention*

Sorry, what?

Here is a paper explaining it: https://arxiv.org/abs/2006.16236