|
While in Google Research, I worked with two of the authors of the "Attention is All you Need" paper, including the gentleman who chose that title. As others have pointed out, self-attention was already a known concept in the research community. They don't claim to have invented that. Rather, the authors began by looking at how to improve the power of feed-forward neural networks using a combination of techniques, obtained some exciting results, and then, in the course of ablation studies, discovered that attention was really all you needed! The title is a play on the Beatles song, "All You Need Is Love". In terms of expository style, the paper that was most helpful for me was [Formal Algorithms for Transformers](https://arxiv.org/abs/2207.09238) by Phuong and Hutter. Written for clarity and with an emphasis on precision, the motivation section (Section 2) of the paper does a great job of explaining deficiencies in the original paper and subsequent ones. |
"Source code vs pseudocode. Providing open source code is very useful, but not a proper substitute for formal algorithms. There is a massive difference between a (partial) Python dump and well-crafted pseudocode. A lot of abstraction and clean-up is necessary: remove boiler plate code, use mostly single-letter variable names, replace code by math expressions wherever possible, e.g. replace loops by sums, remove (some) optimizations, etc. A well-crafted pseudocode is often less than a page and still essentially complete, compared to often thousands of lines of real source code.'