| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bitL 2685 days ago
	You can't even train BERT_large on a 12/16GB GPU, and on a single 15TFlops GPU it might take a year to train. GPUs are too slow :-(

2 comments

riku_iki 2685 days ago

TPU is also slow, they used pod with 64 TPUs for training BERT. You probably can achieve similar result using distributed training on multiple GPU machines.

link

solomatov 2685 days ago

You can but it will be really slow. You can load just parts of the model, and store them on the disk/in memory :-)

link

bitL 2684 days ago

Technically correct ;-)

link

solomatov 2684 days ago

It's actually isn't that bad. Tensorflow and pytorch has support for it, but the penalty will be quite large.

link