| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by criticaltinker 1640 days ago

Transformer based architectures and unsupervised pre-training are achieving state of the art results across multiple modalities including NLP, CV, speech recognition, genomics, physics etc - so here's my must read list of recent papers on the topics (along with some of my notes). Happy holidays!

[1] Attention Is All You Need (2017) https://paperswithcode.com/paper/attention-is-all-you-need

Introduced the Transformer architecture and applied it to NLP tasks.

[2] The Annotated Transformer (2018) https://nlp.seas.harvard.edu/2018/04/03/attention.html

An “annotated” version of [1] in the form of a line-by-line Pytorch implementation. Super helpful for learning how to implement Transformers in practice!

[3] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018) https://paperswithcode.com/paper/bert-pre-training-of-deep-b...

One of the most highly cited papers in machine learning! Proposed an unsupervised pre-training objective called masked language modeling; learned bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.

Bonus: https://nlp.stanford.edu/seminar/details/jdevlin.pdf

See the above slideshow from the primary author, noting the remarkably prescient conclusion: "With [unsupervised] pre-training, bigger == better, without clear limits (so far)"

[4] Conformer: Convolution-augmented Transformer for Speech Recognition (2020) https://paperswithcode.com/paper/conformer-convolution-augme...

Proposed an architecture combining aspects of CNNs and Transformers; performed data augmentation in frequency domain (spectral augmentation).

[5] Scaling Laws for Neural Language Models (2020) https://paperswithcode.com/paper/scaling-laws-for-neural-lan...

Arguably one of the most important papers published in the last 5 years! Studies empirical scaling laws for (Transformer) language models; performance scales as a power-law with model size, dataset size, and amount of compute used for training; trends span more than seven orders of magnitude.

[6] Language Models are Few-Shot Learners (May 2020, NeurIPS 2020 Best Paper) https://paperswithcode.com/paper/language-models-are-few-sho...

Introduced GPT-3, a Tranformer model with 175 billion parameters, 10x more than any previous non-sparse language model. Trained on Azure's AI supercomputer, training costs rumored to be over 12 million USD. Presented evidence that the average person cannot distinguish between real or GPT-3 generated news articles that are ~500 words long.

[7] CvT: Introducing Convolutions to Vision Transformers (May 2020) https://paperswithcode.com/paper/cvt-introducing-convolution...

Introduced the Convolutional vision Transformer (CvT) which has alternating layers of convolution and attention; used supervised pre-training on ImageNet-22k.

[8] Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition (Oct 2020) https://paperswithcode.com/paper/pushing-the-limits-of-semi-...

Scaled up the Conformer architecture to 1B parameters; used both unsupervised pre-training and iterative self-training. Observed through ablative analysis that unsupervised pre-training is the key to enabling growth in model size to transfer to model performance.

[9] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (Jan 2021) https://paperswithcode.com/paper/switch-transformers-scaling...

Introduced the Switch Transformer architecture, a sparse Mixture of Experts model advancing the scale of language models by pre-training up to 1 trillion parameter models. The sparsely-activated model has an outrageous number of parameters, but a constant computational cost. 1T parameter model was distilled (shrunk) by 99% while retaining 30% of the performance benefit of the larger model. Findings were consistent with [5].

[10] ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing (August 2021) https://paperswithcode.com/paper/prottrans-towards-cracking-...

Applied Transformer based NLP models to classify & predict properties of protein structure for a given amino acid sequence, using supercomputers at Oak Ridge National Laboratory. Proved that unsupervised pre-training captured useful features; used learned representation as input to small CNN/FNN models, yielding results challenging state of the art methods, notably without using multiple sequence alignment (MSA) and evolutionary information (EI) as input. Highlighted a remarkable trend across an immense diversity of protein LMs and corpus: performance on downstream supervised tasks increased with the number of samples presented during unsupervised pre-training.

[11] CoAtNet: Marrying Convolution and Attention for All Data Sizes (December 2021) https://paperswithcode.com/paper/coatnet-marrying-convolutio...

Current state of the art Top-1 Accuracy on ImageNet.

2 comments

OttPeterR 1639 days ago

Thanks for this thoughtful list. I try not to flood my ML dev team with too much academic reading but obviously some are too important. Seeing another persons take on what’s important helps me refine what I give to the newcomers to get them up to speed.

link

kettleballroll 1640 days ago

The ViT paper doesn't make your list?

link

criticaltinker 1640 days ago

Good suggestion, it was tough to narrow down the list! Here is a link to the ViT paper in case others are interested [1].

According to the latest ImageNet standings [2], ViT appears to have slipped to second place in Top-1 Accuracy. CoAtNet-7 is the new leader, but only by a slight margin and at the cost of what appears to be a significantly larger model.

[1] Scaling Vision Transformers https://paperswithcode.com/paper/scaling-vision-transformers

[2] https://paperswithcode.com/sota/image-classification-on-imag...

link

kettleballroll 1640 days ago

That isn't the ViT paper, this one is https://paperswithcode.com/paper/an-image-is-worth-16x16-wor...

link