Y
Hacker News
new
|
ask
|
show
|
jobs
by
igul222
3879 days ago
This is also not quite right. Models whose parameters are too big to fit on one GPU can be trained by splitting them across multiple GPUs, as was done here, for example:
http://papers.nips.cc/paper/5346-sequence-to-sequence-learni...
1 comments
brianchu
3879 days ago
According to the paper, the parameters fit on one GPU (or at least that one GPU was able to train the model). It was just too slow, so they trained on 8 GPUs in parallel. But those GPUs were still on the same machine (one node, multiple GPUs).
link