| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by phowon 2679 days ago
	It's relatively small modification of BERT with multi-task fine-tuning and slightly different output heads. It should be easy for any NLP researcher to replicate.

1 comments

riku_iki 2679 days ago

except you need significant GPU/TPU resources to pretrain language model.

link

bitL 2679 days ago

You can't even train BERT_large on a 12/16GB GPU, and on a single 15TFlops GPU it might take a year to train. GPUs are too slow :-(

link

riku_iki 2679 days ago

TPU is also slow, they used pod with 64 TPUs for training BERT. You probably can achieve similar result using distributed training on multiple GPU machines.

link

solomatov 2679 days ago

You can but it will be really slow. You can load just parts of the model, and store them on the disk/in memory :-)

link

bitL 2678 days ago

Technically correct ;-)

link

solomatov 2678 days ago

It's actually isn't that bad. Tensorflow and pytorch has support for it, but the penalty will be quite large.

link

solomatov 2679 days ago

The authors of the paper didn't pretrain the language model. They used an existing BERT and fine tuned it in a novel way.

link

riku_iki 2679 days ago

Could you provide citation? I tried to find this but couldn't.

link

solomatov 2679 days ago

>The training procedure of MT-DNN consists of two stages: pretraining and multi-task fine-tuning. The pretraining stage follows that of the BERT model (Devlin et al., 2018). The parameters of the lexicon encoder and Transformer encoder are learned using two unsupervised prediction tasks: masked language modeling and next sentence pre- diction.3

and this:

>Our implementation of MT-DNN is based on the PyTorch implementation of BERT4. We used Adamax (Kingma and Ba, 2014) as our optimizer with a learning rate of 5e-5 and a batch size of 32. The maximum number of epochs was set to 5. A linear learning rate decay schedule with warm-up over 0.1 was used, unless stated otherwise. Fol- lowing (Liu et al., 2018a), we set the number of steps to 5 with a dropout rate of 0.1. To avoid the exploding gradient problem, we clipped the gradi- ent norm within 1. All the texts were tokenized using wordpieces, and were chopped to spans no longer than 512 tokens.

You won't be able to train BERT in 3 epochs.

link

solomatov 2679 days ago

Here's the quote from BERT:

>We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus.

link

danielcampos93 2678 days ago

Can confirm from conversations I had with the authors.

link