Noam Shazeer was one of the lead authors of the seminal paper "Attention Is All You Need", which introduced the transformer architecture. (From Wikipedia)
The architecture was Shazeer's, but the rough idea came from Jakob Uszkoreit who initiated the project.
Uszkoreit wanted to build a more efficient/scalable language/seq2seq model that could take advantage of GPU parallelism (replacing RNNs which were the main approach to sequence modelling at that time).
Uszkoreit's insight was that although language appears sequential, it is in fact really part parallel part hierarchical, as can be seen by linguist's sentence parse trees where at each level there is parallelism/independence between the branches of the tree, with them getting combined at the next level up. This is what gave rise to the idea of a model that consisted of a stack of of parallel processing layers (transformer layers). I believe that attention was also part of the plan from day one, as this had already been proven to be valuable (Bahdanau) with RNN seq2seq modelling.
So, this is what Uszkoreit wanted to build, but by his own account he failed to come up with an implementation that matched or outperformed the prevailing RNN approach that he wanted to replace. At this point, Uszkoreit mentioned the idea to Shazeer, who got on board and eventually arrived at a performant architecture which was then pared back by an ablation process resulting in the initial encoder-decoder Transformer architecture. Shazeer later came up with the mixture-of-experts architecture, and also other optimizations after he left to found character.ai
Curious about others' contributions, such as Vaswani, Parmar, Jones and Gomez, to the paper. What sucks about co-authorship in research papers is that you don't get a clean breakdown of who contributed what to the research paper, and the distribution (in more cases than not) is very much like a pareto distribution.
I'm talking from plenty of group project experience here.
Can you expound on the ablation process? Is that referring to a stripping down of the data or weights or something? Or a stripping down of the transformer architecture structurally? Just curious
You train the model then do a baseline evaluation. Then you evaluate many variants where you have removed or nulled out different layers or chunks of the model. By comparing the performance of those mutated models to the baseline you can learn a lot about the model. What parts don't have much value and can be removed, the location of "functions" or "facts." Etc. Google it.
It was originally built as a general purpose sequence-to-sequence (seq2seq) model.
The research history leading up to this was interesting - there had been a bunch of work, in various domains, on "autoencoder" architectures used to learn compact representations for things like dimensionality reduction and sequence representation. The idea was to have an encoder-decoder pair, connected by a limited bottleneck representation, with the training goal of the decoder reconstructing the encoder input from the bottleneck representation.
One example of this was to learn a fixed size(!) sequence (e.g. sentence) representation using an LSTM-based autoencoder (LSTM->embedding->LSTM), which at the time seemed rather shocking - the ability to represent a variable length sequence with a fixed size embedding. Equally shocking was that you could use this for machine translation simply by connecting an LSTM encoder for one language to an LSTM decoder for another language.
This type of LSTM->LSTM seq2seq encode-decode architecture for machine translation was then improved by Bahdanau by replacing the fixed size representation with an attention mechanism so the decoder could learn to be more specific about input-output relationships.
This type of LSTM-based seq2seq encode-decode architecture, using attention, is what Uszkoreit et al set out to improve - to make more efficient by using a parallel vs sequential (RNN) architecture. The Transformer was never conceived of as purely for language modelling, or as an "AI" architecture. Later when the usage focused on language modelling (generation, not translation), the encoder was dropped since input and output are the same thing.
If you read the Wired article linked elsewhere on this thread, then it explains that. The work was being done by people from the Google Translate team.
Source for this? The notion of attention dates to a content-addressable lookup during sequence alignment (as well as, concurrently, memory lookups in neural Turing machines). Attention had been used in other models, like GRUs and LSTMs with attention. The Vaswani et. al. paper did not introduce attention, just removed everything _but_ attention (and FFW) from the network. Are you claiming the "critical idea" of removing the GRU and LSTM parts and just keeping attention was "truly" Noam's?
At some point in late 2017 the paper was updated with this additional detail:
Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research.
In any case, if the authors considered their contributions equal, that's good enough for me.
Thanks - wanted to point to this, and indeed should have worded my claim more precisely. And yes, am aware of prior work on attention.
(I need to look it up, but I recall Noam saying publicly that he wouldn’t have agreed to random ordering of contributions if he knew this was going to be this big).
Nope, but it’s not particularly unknown either. It shouldn’t be a surprise; he had remarkable research contributions before and after (separately, he was also an IMO gold medalist).