|
|
|
|
|
by albertzeyer
1142 days ago
|
|
Why do you think the original paper title was a poor choice? It very much highlights the main idea, the main aspect which is studied in this paper. The paper title is "Attention is all you need", for those who don't know. And attention at that point in time was already very well known and part of the standard translation model. But all those attention-based encoder-decoder models where using LSTMs, or maybe CNNs. Self-attention was also already known at that point, although still rarely used. So the novelty was the study on whether a model where you remove almost everything else, except of attention, whether this still works. Such study was on the one side just interesting in itself. But then, such model also had some advantages like faster training. In the next few years, the faster training was actually the main advantage over LSTM-based models. For a long time, it was never really clear whether a Transformer is really better than a LSTM-based model when trained the same number of epochs. In most comparisons, Transformer were simply trained much more epochs. |
|