|
|
|
|
|
by atgctg
924 days ago
|
|
A lot of transformer explanations fail to mention what makes self attention so powerful. Unlike traditional neural networks with fixed weights, self-attention layers adaptively weight connections between inputs based on context. This allows transformers to accomplish in a single layer what would take traditional networks multiple layers. |
|
1. There are the model weights, aka the parameters. These are what get adjusted during training to do the learning part. They always exist.
2. There are attention weights. These are part of the transformer architecture and they “weight” the context of the input. They are ephemeral. Used and discarded. Don’t always exist.
They are both typically 32-bit floats in case you’re curious but still different concepts.