|
|
|
|
|
by ttul
1062 days ago
|
|
The key differences between multi-head attention in Transformers and the proposed multi-scale retention (MSR) in RetNets are: - Retention replaces the softmax in attention with an exponential decay along the sequence dimension. This allows formulating retention in a recurrent form for efficient O(1) inference. - Retention heads use different decay rates (gamma values) for multi-scale modeling. Attention heads use the same softmax. - Retention outputs are normalized per-head with GroupNorm before concatenation. Attention uses LayerNorm on the concatenated output. - Retention can be computed in parallel, recurrent, or chunkwise recurrent modes. Attention is only parallel. - The recurrent form enables RetNets to summarize long previous context into a fixed-size state during inference. Attention recomputes on the full context each step. - So in summary, retention adapts attention to enable recurrent modeling and multi-scale decays. This provides efficiency benefits and competitive performance to Transformers. |
|