|
|
|
|
|
by slickytail
395 days ago
|
|
The relative unimportance of the exact SDPA attention in use in modern transformers is already known: https://arxiv.org/abs/2111.11418 The FFN, normalization, and residual connections are absolutely irreplaceable -- but attention can be replaced with almost any other layer that shares information between tokens, such as pooling, convolution, random mixing, etc. |
|