Hacker News new | ask | show | jobs
by bitL 2685 days ago
You can't even train BERT_large on a 12/16GB GPU, and on a single 15TFlops GPU it might take a year to train. GPUs are too slow :-(
2 comments

TPU is also slow, they used pod with 64 TPUs for training BERT. You probably can achieve similar result using distributed training on multiple GPU machines.
You can but it will be really slow. You can load just parts of the model, and store them on the disk/in memory :-)
Technically correct ;-)
It's actually isn't that bad. Tensorflow and pytorch has support for it, but the penalty will be quite large.