| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by craffel 2349 days ago
	We include MASS in our empirical survey (see e.g. section 3.3.2 of our paper, https://arxiv.org/pdf/1910.10683.pdf). FWIW, people were pre-training Transformers before MASS, e.g. "Improving Language Understanding by Generative Pre-Training" by Radford et al. from 2018. Even further back, "Semi-Supervised Sequence Learning" by Dai et al. describe pre-training an RNN encoder-decoder model for subsequent transfer.

1 comments

kitsune_ 2349 days ago

But Radford is just pretraining the decoder and qualitatively different from a seq2seq approach such as MASS. If we just look at the original paper from Vaswani, than "pretraining a transformer" imho should always only have meant pretraing the encoder and decoder. Obviously that ship has sailed.

link