|
|
|
|
|
by Kubuxu
135 days ago
|
|
A paper on the same topic: On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective, Gabriel Mongaras, Eric C. Larson, https://arxiv.org/abs/2507.23632 Video presentation if someone prefers it: https://www.youtube.com/watch?v=PN3nYBowSvM Linear attention is a first-degree approximation of Softmax attention, and model performance gets better as you increase the degree of the Taylor approximation. I'm thinking about adapting an existing model to Taylor-approximated attention. I think it should be possible with some model surgery and rehabilitation training. |
|