Hacker News new | ask | show | jobs
by craffel 2302 days ago
We include MASS in our empirical survey (see e.g. section 3.3.2 of our paper, https://arxiv.org/pdf/1910.10683.pdf). FWIW, people were pre-training Transformers before MASS, e.g. "Improving Language Understanding by Generative Pre-Training" by Radford et al. from 2018. Even further back, "Semi-Supervised Sequence Learning" by Dai et al. describe pre-training an RNN encoder-decoder model for subsequent transfer.
1 comments

But Radford is just pretraining the decoder and qualitatively different from a seq2seq approach such as MASS. If we just look at the original paper from Vaswani, than "pretraining a transformer" imho should always only have meant pretraing the encoder and decoder. Obviously that ship has sailed.