| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by modeless 3880 days ago
	This is wrong. Training data can be streamed through GPU memory during training. It's your parameters that can't exceed GPU memory. You can get GPUs with 12 GB of memory, and they also support float16 so they can be twice as memory efficient as CPUs. If your model has more parameters than that, then you'll be waiting months or years for a single model to train using CPUs, even distributed. Furthermore, almost any technique you use to distribute and scale training will work just as well regardless of whether the computations are happening on CPUs or GPUs.

3 comments

igul222 3880 days ago

This is also not quite right. Models whose parameters are too big to fit on one GPU can be trained by splitting them across multiple GPUs, as was done here, for example: http://papers.nips.cc/paper/5346-sequence-to-sequence-learni...

link

brianchu 3880 days ago

According to the paper, the parameters fit on one GPU (or at least that one GPU was able to train the model). It was just too slow, so they trained on 8 GPUs in parallel. But those GPUs were still on the same machine (one node, multiple GPUs).

link

gojomo 3880 days ago

At least back in 2012, Google research seemed to suggest that distributed CPU training of large models could sometimes be preferable to fitting within the limits of GPUs:

http://research.google.com/archive/large_deep_networks_nips2...

link

limau 3880 days ago

SINGA provides an abstraction for defining all known models,a neural net structure that is easy for model and data partitioning, and a parallelism model that supports both synchronous, asynchronous and hybrid training frameworks. Processing at each node could the be CPU, GPU, or CPU-GPU based.

link