|
|
|
|
|
by quantadev
595 days ago
|
|
When discussing "Attention Heads" in the context of the Transformers Paper, there's no need to put the word "Self" in front of it, as in "Self-Attention". That's the context in which I used the word Attention above. Something similar to self-attention had pre-existed this paper, but not actual self-attention. You're right that getting rid of "Recurrence" was another innovation, but removing it was probably more of a hack to make things parallelizable, than something that was architecturally justifiable from first principles (like self-attention is), because there's definite "power" in Recurrence (making it desirable), but it's just too costly to run that in LLMs because of CPU cycles. |
|
But that's the entire point of it. Transformer-based LLM are “more intelligent” just because you can make them bigger and train them on bigger datasets because of this parallelization.