Hacker News new | ask | show | jobs
by ttul 1062 days ago
The key differences between multi-head attention in Transformers and the proposed multi-scale retention (MSR) in RetNets are:

- Retention replaces the softmax in attention with an exponential decay along the sequence dimension. This allows formulating retention in a recurrent form for efficient O(1) inference.

- Retention heads use different decay rates (gamma values) for multi-scale modeling. Attention heads use the same softmax.

- Retention outputs are normalized per-head with GroupNorm before concatenation. Attention uses LayerNorm on the concatenated output.

- Retention can be computed in parallel, recurrent, or chunkwise recurrent modes. Attention is only parallel.

- The recurrent form enables RetNets to summarize long previous context into a fixed-size state during inference. Attention recomputes on the full context each step.

- So in summary, retention adapts attention to enable recurrent modeling and multi-scale decays. This provides efficiency benefits and competitive performance to Transformers.

1 comments

With transformer models, it’s common to put instructions and system messages at the beginning of the input. But with this decay, the beginning of the input would always have the sparsest attention, right? Maybe the instructions should be moved to the end. But then again if it’s recurrent, you might want to prime it with a description of the task.
Hypothetically, fine tuned transformer networks would learn to attend more to the beginning and end of the sequence since that is where the instructions typically are. And lo and behold a recent paper demonstrated this to be true.