| HN Mirror

Roughly:

1) The initial seq-2-seq approach was using LSTMs - one to encode the input sequence, and one to decode the output sequence. It's amazing that this worked at all - encode a variable length sentence into a fixed size vector, then decode it back into another sequence, usually of different length (e.g. translate from one language to another).

2) There are two weaknesses of this RNN/LSTM approach - the fixed size representation, and the corresponding lack of ability to determine which parts of the input sequence to use when generating specific parts of the output sequence. These deficiencies were addressed by Bahdanau et al in an architecture that combined encoder-decoder RNNs with an attention mechanism ("Bahdanau attention") that looked at each past state of the RNN, not just the final one.

3) RNNs are inefficient to train, so Jakob Uszkoreit was motivated to come up with an approach that better utilized available massively parallel hardware, and noted that language is as much hierarchical as sequential, suggesting a layered architecture where at each layer the tokens of the sub-sequence would be processed in parallel, while retaining a Bahdanau-type attention mechanism where these tokens would attend to each other ("self-attention") to predict the next layer of the hierarchy. Apparently in initial implementation the idea worked, but not better than other contemporary approaches (incl. convolution), but then another team member, Noam Shazeer, took the idea and developed it, coming up with an architecture (which I've never seen described) that worked much better, which was then experimentally ablated to remove unnecessary components, resulting in the original transformer. I'm not sure who came up with the specific key-based form of attention in this final architecture.

4) The original transformer, as described in the "attention is all you need paper", still had a separate encoder and decoder, copying earlier RNN based approaches, and this was used in some early models such as Google's BERT, but this is unnecessary for language models, and OpenAI's GPT just used the decoder component, which is what everyone uses today. With this decoder-only transformer architecture the input sentence is input into the bottom layer of the transformer, and transformed one step at a time as it passes through each subsequent layer, before emerging at the top. The input sequence has an end-of-sequence token appended to it, which is what gets transformed into the next-token (last token) of the output sequence.