Hacker News new | ask | show | jobs
by albertzeyer 1142 days ago
Why do you think the original paper title was a poor choice? It very much highlights the main idea, the main aspect which is studied in this paper.

The paper title is "Attention is all you need", for those who don't know.

And attention at that point in time was already very well known and part of the standard translation model. But all those attention-based encoder-decoder models where using LSTMs, or maybe CNNs. Self-attention was also already known at that point, although still rarely used. So the novelty was the study on whether a model where you remove almost everything else, except of attention, whether this still works.

Such study was on the one side just interesting in itself. But then, such model also had some advantages like faster training. In the next few years, the faster training was actually the main advantage over LSTM-based models. For a long time, it was never really clear whether a Transformer is really better than a LSTM-based model when trained the same number of epochs. In most comparisons, Transformer were simply trained much more epochs.

1 comments

I'm well aware of the research that led to it. I was already working in the field back then and I remember that the community was far from realizing how monumental this paper would end up being. Otherwise the authors probably would have considered a more informative or at least less ambiguous title. It also didn't help that the architecture they described (encoder-decoder) was actually even more complicated than what we have now in GPT and the likes. And the really important thing was not that it could train more epochs than recurrent architectures (although that certainly helped the huge models that came later), but it could drastically extend context length for sequence tasks. They went from a theoretically infinite (but in practice very limited) context length to a fundamentally limited but practically obtainable one.