Y
Hacker News
new
|
ask
|
show
|
jobs
by
bitL
2685 days ago
You can't even train BERT_large on a 12/16GB GPU, and on a single 15TFlops GPU it might take a year to train. GPUs are too slow :-(
2 comments
riku_iki
2685 days ago
TPU is also slow, they used pod with 64 TPUs for training BERT. You probably can achieve similar result using distributed training on multiple GPU machines.
link
solomatov
2685 days ago
You can but it will be really slow. You can load just parts of the model, and store them on the disk/in memory :-)
link
bitL
2684 days ago
Technically correct ;-)
link
solomatov
2684 days ago
It's actually isn't that bad. Tensorflow and pytorch has support for it, but the penalty will be quite large.
link