>The training procedure of MT-DNN consists of two stages: pretraining and multi-task fine-tuning. The pretraining stage follows that of the BERT model (Devlin et al., 2018). The parameters of the lexicon encoder and Transformer encoder are learned using two unsupervised prediction tasks: masked language modeling and next sentence pre- diction.3
and this:
>Our implementation of MT-DNN is based on the PyTorch implementation of BERT4. We used Adamax (Kingma and Ba, 2014) as our optimizer with a learning rate of 5e-5 and a batch size of 32. The maximum number of epochs was set to 5. A linear learning rate decay schedule with warm-up over 0.1 was used, unless stated otherwise. Fol- lowing (Liu et al., 2018a), we set the number of steps to 5 with a dropout rate of 0.1. To avoid the exploding gradient problem, we clipped the gradi- ent norm within 1. All the texts were tokenized using wordpieces, and were chopped to spans no longer than 512 tokens.
>We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus.
and this:
>Our implementation of MT-DNN is based on the PyTorch implementation of BERT4. We used Adamax (Kingma and Ba, 2014) as our optimizer with a learning rate of 5e-5 and a batch size of 32. The maximum number of epochs was set to 5. A linear learning rate decay schedule with warm-up over 0.1 was used, unless stated otherwise. Fol- lowing (Liu et al., 2018a), we set the number of steps to 5 with a dropout rate of 0.1. To avoid the exploding gradient problem, we clipped the gradi- ent norm within 1. All the texts were tokenized using wordpieces, and were chopped to spans no longer than 512 tokens.
You won't be able to train BERT in 3 epochs.