Hacker News new | ask | show | jobs
by paol 2171 days ago
It's worth noting that if your ML work is entirely CUDA based (as often happens), you likely won't benefit from a Threadripper CPU. Downgrading to a Ryzen 9 or even 7 will reduce costs by a good bit. The savings can be pocketed or put toward a second Titan RTX + NVLink (48Gb usable VRAM).
7 comments

Should note (from someone who has a few of these systems at my lab) unfortunately the consumer RTX cards don't do memory pooling. This means that although NVLINK is good for inter-GPU comms it doesn't actually allow you to run giant models that need the entire 48GB of memory for a backwards pass (treat the combined cards as "one card"). Not typically a problem for most people but worth mentioning
None of the ML frameworks support memory pooling so unless you write cuda code yourself this point is moot.
From https://www.nvidia.com/en-us/deep-learning-ai/products/titan...:

"NVIDIA TITAN RTX NVLink Bridge

The TITAN RTX NVLink™ bridge connects two TITAN RTX cards together over a 100 GB/s interface. The result is an effective doubling of memory capacity to 48 GB, so that you can train neural networks faster, process even larger datasets, and work with some of the biggest rendering models."

Yeah you're not wrong, but it's a bit misleading. This allows you to run faster, but it does it by allowing you to use a larger batch size (arguably not best practice but your mileage will vary). Memory pooling is a bit different in that you can treat the combined cards as a single card from TF/pytorch.
But batch size is prob least problem since you can do data parallelism (send half batch to each gpu, combine on cpu).

I think only model bigger than gpu mem is where you really wish for nvlink on v100s.

Memory pooling is irrelevant for DL training. 24 GB is enough to run batch size of 1 for Bert-Large so honestly this is a good choice. Some folks are saying that 2x 2080 Tis would have been better and that's true if you're doing convnets but any large scale language model fine-tuning you'll want to have at least 24 GB of vRAM.
You contradict yourself. Memory pooling is precisely what would allow you to train your bert large on two 2080ti.
No, my comment says that the two 2080 Tis would be better for convnets / situations where you don’t need to train Bert-Large. If you’re sure about memory pooling looking working for DL, please share code and examples, we would love to see one.
Yeah, Quadros ... the cocaine of the ML world.
I think the nice Volta cards (V100) does it "properly". But out of reach for most small scale setups (academic labs, prosumer, independent researcher, etc.).

Unfortunately the best case for high-mem use-cases is to just rent from GCP.

> It's worth noting that if your ML work is entirely CUDA based (as often happens), you likely won't benefit from a Threadripper CPU

Perhaps for the actual ML part, yes, but a ton of work must be done first to organize and filter the data, which is where all those cores would come in handy.

Most Ryzen consumer motherboards have a limit of 128 gigs of RAM and 16-20 direct to the CPU pcie lanes. Is 128 gigs of ram and x8 pcie lanes for dual GPUs, a bottleneck for ML workloads?

I can see the lanes not being an issue for the next gen Titans, that will likely use pcie 4.0, but that is months away.

Asking as someone outside the ML field.

In order the bottleneck is: gpu ram, cpu ram, then pci-e lanes.

There is a big delay moving memory from ram to vram to run a task on the gpu, so much so that you'd be better off running the task on the cpu if you can't fit it all in the gpu, or are very clever in how data is buffered, which isn't an option for neural networks. Because of this, the pci-e lane is not saturated except when first sending the data to vram. PCI-E 3.0 x8 runs at 7880MB/s, so if your gpu has 16gb of vram, the difference between x8 and x16 is 1 second, when a task can typically take 8+ hours to complete.

The reduction from 16x to 8x PCIe lanes is usually not a bottleneck for ML. Still, it's always a good idea to benchmark and validate the configuration, especially if you're planning to spend a lot of money on a bunch of identical systems.

As for RAM, only you can know how big your datasets are. But if you're training models on GPUs the bottleneck is almost certainly going to be GPU RAM, not system RAM.

Yes, thats about $1000 savings. Also, the 80+ Gold power supply is an inefficient choice given then lack of a second GPU; without the second Titan that 1000W power supply will never see 50% load. If you're over-buying power supplies for future expansion then use an efficient titanium rated supply which will waste less power at low loads. The price difference is $80.
Not really that inefficient, like a 3 - 5% difference between Gold & Titanium[1]. Additionally the Corsair RM1000x actually breaks into 80+ Platinum territory in testing[2]. Also I am skeptical of a 1KW Titanium rated PSU for under $300 that's actually in stock.

1. https://en.wikipedia.org/wiki/80_Plus#Efficiency_level_certi...

2. https://www.jonnyguru.com/blog/2015/10/25/corsair-rm1000x-10...

I thought the 80+ Certifications were about how efficient the PSU was at not converting electricity into heat, ie. loss? Perhaps I was wrong?
IIRC PC power supplies are most efficient at around 80% utilization. Below that they are not able to hit their "rates" efficiencies.
Hmm, interesting. I've always oversized by PSU's as a matter of course, since I've always thought working at 60% capacity is better than 90% or whatever.

I usually drop a 750 watt 80+ Gold into most of my builds, even though a 500 watt or even a 450 watt would be sufficient with a single GPU, and have no plans for a second GPU.

aiming for 50-60% capacity during typical operation is the standard recommendation.

Efficiency usually starts tapering off below 50% and below 30% it falls off a cliff - however, that just means instead of an ideal 10W power consumption you're actually pulling 30W or something like that, it is usually not a big deal in absolute terms.

(there are also some exceptions, some of the platium/titanium PSUs actually can hold pretty decent efficiencies right down into the basement.)

750W is a good "standard" recommendation, that's enough for any one GPU on the market.

The rule of thumb is really more to guide people not to buy 1600W or 2000W monster PSUs just because "bigger number is better!".

(Although those giant PSUs do have the advantage that they can often run completely passively under load, they won't kick fans on until 50% or 60% load, which for a 1600W PSU means you can comfortably run a high-end GPU and a high-end CPU completely passively.)

> Perhaps I was wrong?

No, you're not wrong. Not sure how what I wrote conflicts with that. I don't think it does.

> Also, the 80+ Gold power supply is an inefficient choice given then lack of a second GPU

That's what tripped me up, I think. Even without a second GPU, 80+ Gold or better is a good choice.

Your next sentence makes sense though, 1KW PSU's are usually overkill, even if you like to oversize your PSU like I do.

> 80+ Gold or better is a good choice.

The selection was "gold", specifically. And that's not as bad as it might be, but titanium is better across the board and much better at low load. A titanium supply is more efficient at 20% load than a gold supply at 50%, for instance.

If you're over-sizing your power supply by ~60% (as is the case here) then this is significant.

I'll keep that in mind on my next build. The pricing steps up quire radically though, it seems.

But, you build enough "rigs", you learn not to skimp on certain components like PSU's, Cases and Motherboards... which is normally where new builders cut corners.

A caveat is that if you’re going to use multiple GPUs it’s essential to get something like a Threadripper or a Xeon that has the pcie lanes to provide the full 16 lanes or at least 8 lanes to each GPU.
I’ve found that it’s really nice for things like image augmentation, and running RL environments in parallel. But maybe I should be doing augmentation in Dali.
It depends on how intensive your pre-processing pipeline is. With a really fast accelerator you can quite easily start to be bottlenecked by your CPU.
True, but Threadrippers start at 24 cores and go up from there. That's got to be some intense pre-processing. Not impossible I'm sure, but it would be unusual.
Threadripper is the only way to get more than the standard 20 PCIe lanes (and really only 16 lanes to the slots, on all but one board). It's possible that OP would have gone with a lower core count version if one existed, but the minimum buy-in on Threadripper 3000 series is the 24 core model.

tbh this is kind of one of the ideal use-cases for Epyc. And with the way AMD has set up their pricing, it's actually no longer cheaper to use the workstation processors, in some situations it's significantly more expensive, they are really ripping you for the clock speed, and removing a bunch of other features in the process (RDIMM/LRDIMM support, etc). I strongly encourage everyone doing homelab and home ML rigs and similar stuff to really think about whether they want Threadripper, bearing in mind that threadripper is often more expensive than Epyc. It's no longer an obvious choice that server processors are for servers and home users can only afford workstation, it is the other way around.

AMD offers some low-core-count single-socket Epycs that are ideal for "lighting up the platform" tasks like this. Like, 7232P is a $450 processor and the 7402P is $1150. And they don't offer anything like that on Threadripper. They clock slower, sure, but they're not really using the CPU anyway. And that gets you a full 128 PCIe lanes, octochannel memory and RDIMM/LRDIMM support so they can stack in the memory.

If they want to game on it in their spare time then sure, Threadripper is probably the way to go.

+1. EPYC 7282/7302P/7402P is cheaper than we expect and gets massive RAM/IO capabilities. M/B is also not so expensive.+1. EPYC 7282/7302P/7402P is cheaper than we expect and gets massive RAM/IO capabilities. M/B is also not so expensive. Downside is that higher clock SKU of EPYC is expensive.
Threadripper Pro is an ideal processor (higher clocked Epyc). Unfortunately it's OEM only at this point.