|
|
|
|
|
by littlestymaar
596 days ago
|
|
Attention heads existed before Transformers, they where used in recurrent neural networks (RNN) to improve their performance. The paper is called “Attention is all you need” because transformers keep the attention head while discarding the RNN part entirely. Getting rid of RNN vastly improved training scalability and allowed big players to start training enormous models on even more enormous training set in ways that weren't possible with a RNN AFAIK. |
|
You're right that getting rid of "Recurrence" was another innovation, but removing it was probably more of a hack to make things parallelizable, than something that was architecturally justifiable from first principles (like self-attention is), because there's definite "power" in Recurrence (making it desirable), but it's just too costly to run that in LLMs because of CPU cycles.