|
|
|
|
|
by quantadev
596 days ago
|
|
The biggest thing I had understood about the Transformers Paper (Attention is all you Need) is how the "attention heads" vectors are wired up in such a way as to allow words to be "understood" in the proper context. In other words "see spot run" is different from "run a computer program" has dramatically different but specific context for the word "run". It was also my understanding that without those attention heads even the scaling up to current parameter sizes we have to day would not have ended up with the level of emergent intelligence that shocked the world with GPT 3.5. We needed both very large models and words put into semantic context in semantic space. |
|
Getting rid of RNN vastly improved training scalability and allowed big players to start training enormous models on even more enormous training set in ways that weren't possible with a RNN AFAIK.