Hacker News new | ask | show | jobs
by criticaltinker 1640 days ago
Transformer based architectures and unsupervised pre-training are achieving state of the art results across multiple modalities including NLP, CV, speech recognition, genomics, physics etc - so here's my must read list of recent papers on the topics (along with some of my notes). Happy holidays!

[1] Attention Is All You Need (2017) https://paperswithcode.com/paper/attention-is-all-you-need

Introduced the Transformer architecture and applied it to NLP tasks.

[2] The Annotated Transformer (2018) https://nlp.seas.harvard.edu/2018/04/03/attention.html

An “annotated” version of [1] in the form of a line-by-line Pytorch implementation. Super helpful for learning how to implement Transformers in practice!

[3] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018) https://paperswithcode.com/paper/bert-pre-training-of-deep-b...

One of the most highly cited papers in machine learning! Proposed an unsupervised pre-training objective called masked language modeling; learned bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.

Bonus: https://nlp.stanford.edu/seminar/details/jdevlin.pdf

See the above slideshow from the primary author, noting the remarkably prescient conclusion: "With [unsupervised] pre-training, bigger == better, without clear limits (so far)"

[4] Conformer: Convolution-augmented Transformer for Speech Recognition (2020) https://paperswithcode.com/paper/conformer-convolution-augme...

Proposed an architecture combining aspects of CNNs and Transformers; performed data augmentation in frequency domain (spectral augmentation).

[5] Scaling Laws for Neural Language Models (2020) https://paperswithcode.com/paper/scaling-laws-for-neural-lan...

Arguably one of the most important papers published in the last 5 years! Studies empirical scaling laws for (Transformer) language models; performance scales as a power-law with model size, dataset size, and amount of compute used for training; trends span more than seven orders of magnitude.

[6] Language Models are Few-Shot Learners (May 2020, NeurIPS 2020 Best Paper) https://paperswithcode.com/paper/language-models-are-few-sho...

Introduced GPT-3, a Tranformer model with 175 billion parameters, 10x more than any previous non-sparse language model. Trained on Azure's AI supercomputer, training costs rumored to be over 12 million USD. Presented evidence that the average person cannot distinguish between real or GPT-3 generated news articles that are ~500 words long.

[7] CvT: Introducing Convolutions to Vision Transformers (May 2020) https://paperswithcode.com/paper/cvt-introducing-convolution...

Introduced the Convolutional vision Transformer (CvT) which has alternating layers of convolution and attention; used supervised pre-training on ImageNet-22k.

[8] Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition (Oct 2020) https://paperswithcode.com/paper/pushing-the-limits-of-semi-...

Scaled up the Conformer architecture to 1B parameters; used both unsupervised pre-training and iterative self-training. Observed through ablative analysis that unsupervised pre-training is the key to enabling growth in model size to transfer to model performance.

[9] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (Jan 2021) https://paperswithcode.com/paper/switch-transformers-scaling...

Introduced the Switch Transformer architecture, a sparse Mixture of Experts model advancing the scale of language models by pre-training up to 1 trillion parameter models. The sparsely-activated model has an outrageous number of parameters, but a constant computational cost. 1T parameter model was distilled (shrunk) by 99% while retaining 30% of the performance benefit of the larger model. Findings were consistent with [5].

[10] ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing (August 2021) https://paperswithcode.com/paper/prottrans-towards-cracking-...

Applied Transformer based NLP models to classify & predict properties of protein structure for a given amino acid sequence, using supercomputers at Oak Ridge National Laboratory. Proved that unsupervised pre-training captured useful features; used learned representation as input to small CNN/FNN models, yielding results challenging state of the art methods, notably without using multiple sequence alignment (MSA) and evolutionary information (EI) as input. Highlighted a remarkable trend across an immense diversity of protein LMs and corpus: performance on downstream supervised tasks increased with the number of samples presented during unsupervised pre-training.

[11] CoAtNet: Marrying Convolution and Attention for All Data Sizes (December 2021) https://paperswithcode.com/paper/coatnet-marrying-convolutio...

Current state of the art Top-1 Accuracy on ImageNet.

2 comments

Thanks for this thoughtful list. I try not to flood my ML dev team with too much academic reading but obviously some are too important. Seeing another persons take on what’s important helps me refine what I give to the newcomers to get them up to speed.
The ViT paper doesn't make your list?
Good suggestion, it was tough to narrow down the list! Here is a link to the ViT paper in case others are interested [1].

According to the latest ImageNet standings [2], ViT appears to have slipped to second place in Top-1 Accuracy. CoAtNet-7 is the new leader, but only by a slight margin and at the cost of what appears to be a significantly larger model.

[1] Scaling Vision Transformers https://paperswithcode.com/paper/scaling-vision-transformers

[2] https://paperswithcode.com/sota/image-classification-on-imag...