| Transformer based architectures and unsupervised pre-training are achieving state of the art results across multiple modalities including NLP, CV, speech recognition, genomics, physics etc - so here's my must read list of recent papers on the topics (along with some of my notes). Happy holidays! [1] Attention Is All You Need (2017)
https://paperswithcode.com/paper/attention-is-all-you-need Introduced the Transformer architecture and applied it to NLP tasks. [2] The Annotated Transformer (2018)
https://nlp.seas.harvard.edu/2018/04/03/attention.html An “annotated” version of [1] in the form of a line-by-line Pytorch implementation. Super helpful for learning how to implement Transformers in practice! [3] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)
https://paperswithcode.com/paper/bert-pre-training-of-deep-b... One of the most highly cited papers in machine learning!
Proposed an unsupervised pre-training objective called masked language modeling; learned bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. Bonus: https://nlp.stanford.edu/seminar/details/jdevlin.pdf See the above slideshow from the primary author, noting the remarkably prescient conclusion: "With [unsupervised] pre-training, bigger == better, without clear limits (so far)" [4] Conformer: Convolution-augmented Transformer for Speech Recognition (2020)
https://paperswithcode.com/paper/conformer-convolution-augme... Proposed an architecture combining aspects of CNNs and Transformers; performed data augmentation in frequency domain (spectral augmentation). [5] Scaling Laws for Neural Language Models (2020)
https://paperswithcode.com/paper/scaling-laws-for-neural-lan... Arguably one of the most important papers published in the last 5 years!
Studies empirical scaling laws for (Transformer) language models; performance scales as a power-law with model size, dataset size, and amount of compute used for training; trends span more than seven orders of magnitude. [6] Language Models are Few-Shot Learners (May 2020, NeurIPS 2020 Best Paper)
https://paperswithcode.com/paper/language-models-are-few-sho... Introduced GPT-3, a Tranformer model with 175 billion parameters, 10x more than any previous non-sparse language model.
Trained on Azure's AI supercomputer, training costs rumored to be over 12 million USD.
Presented evidence that the average person cannot distinguish between real or GPT-3 generated news articles that are ~500 words long. [7] CvT: Introducing Convolutions to Vision Transformers (May 2020)
https://paperswithcode.com/paper/cvt-introducing-convolution... Introduced the Convolutional vision Transformer (CvT) which has alternating layers of convolution and attention; used supervised pre-training on ImageNet-22k. [8] Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition (Oct 2020)
https://paperswithcode.com/paper/pushing-the-limits-of-semi-... Scaled up the Conformer architecture to 1B parameters; used both unsupervised pre-training and iterative self-training.
Observed through ablative analysis that unsupervised pre-training is the key to enabling growth in model size to transfer to model performance. [9] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (Jan 2021)
https://paperswithcode.com/paper/switch-transformers-scaling... Introduced the Switch Transformer architecture, a sparse Mixture of Experts model advancing the scale of language models by pre-training up to 1 trillion parameter models.
The sparsely-activated model has an outrageous number of parameters, but a constant computational cost. 1T parameter model was distilled (shrunk) by 99% while retaining 30% of the performance benefit of the larger model. Findings were consistent with [5]. [10] ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing (August 2021)
https://paperswithcode.com/paper/prottrans-towards-cracking-... Applied Transformer based NLP models to classify & predict properties of protein structure for a given amino acid sequence, using supercomputers at Oak Ridge National Laboratory.
Proved that unsupervised pre-training captured useful features; used learned representation as input to small CNN/FNN models, yielding results challenging state of the art methods, notably without using multiple sequence alignment (MSA) and evolutionary information (EI) as input.
Highlighted a remarkable trend across an immense diversity of protein LMs and corpus: performance on downstream supervised tasks increased with the number of samples presented during unsupervised pre-training. [11] CoAtNet: Marrying Convolution and Attention for All Data Sizes (December 2021)
https://paperswithcode.com/paper/coatnet-marrying-convolutio... Current state of the art Top-1 Accuracy on ImageNet. |