Hacker News new | ask | show | jobs
by colincooke 2170 days ago
Should note (from someone who has a few of these systems at my lab) unfortunately the consumer RTX cards don't do memory pooling. This means that although NVLINK is good for inter-GPU comms it doesn't actually allow you to run giant models that need the entire 48GB of memory for a backwards pass (treat the combined cards as "one card"). Not typically a problem for most people but worth mentioning
4 comments

None of the ML frameworks support memory pooling so unless you write cuda code yourself this point is moot.
From https://www.nvidia.com/en-us/deep-learning-ai/products/titan...:

"NVIDIA TITAN RTX NVLink Bridge

The TITAN RTX NVLink™ bridge connects two TITAN RTX cards together over a 100 GB/s interface. The result is an effective doubling of memory capacity to 48 GB, so that you can train neural networks faster, process even larger datasets, and work with some of the biggest rendering models."

Yeah you're not wrong, but it's a bit misleading. This allows you to run faster, but it does it by allowing you to use a larger batch size (arguably not best practice but your mileage will vary). Memory pooling is a bit different in that you can treat the combined cards as a single card from TF/pytorch.
But batch size is prob least problem since you can do data parallelism (send half batch to each gpu, combine on cpu).

I think only model bigger than gpu mem is where you really wish for nvlink on v100s.

Memory pooling is irrelevant for DL training. 24 GB is enough to run batch size of 1 for Bert-Large so honestly this is a good choice. Some folks are saying that 2x 2080 Tis would have been better and that's true if you're doing convnets but any large scale language model fine-tuning you'll want to have at least 24 GB of vRAM.
You contradict yourself. Memory pooling is precisely what would allow you to train your bert large on two 2080ti.
No, my comment says that the two 2080 Tis would be better for convnets / situations where you don’t need to train Bert-Large. If you’re sure about memory pooling looking working for DL, please share code and examples, we would love to see one.
Yeah, Quadros ... the cocaine of the ML world.
I think the nice Volta cards (V100) does it "properly". But out of reach for most small scale setups (academic labs, prosumer, independent researcher, etc.).

Unfortunately the best case for high-mem use-cases is to just rent from GCP.