Hacker News new | ask | show | jobs
by angoragoats 703 days ago
Unsure if anyone has specific hardware benchmarks for the 405b model yet, since it's so new, but elsewhere in this thread I outlined a build that'd probably be capable of running a quantized version of Llama 3.1 405b for roughly $10k.

The $10k figure is likely roughly the minimum amount of money/hardware that you'd need to run the model at acceptable speeds, as anything less requires you to compromise heavily on GPU cores (e.g. Tesla P40s also have 24GB of VRAM, for half the price or less, but are much slower than 3090s), or run on the CPU entirely, which I don't think will be viable for this model even with gobs of RAM and CPU cores, just due to its sheer size.

1 comments

Energy costs are an important factor here too. While Quadro cards are much more expensive upfront (higher $/VRAM), they are cheaper over time (lower Watts/Token). Offsetting the energy expense of a 3090/4090/5090 build via solar complicates this calculation but generally speaking can be a "reasonable" way of justifying this much hardware running in a homelab.

I would be curious to see relative failure rates over time of consumer vs Quadro cards as well.

Agree 100% that energy costs are important. The example system in my other post would consume somewhere around 300W at idle, 24/7, which is 219 kWh per month, and that's assuming you aren't using the machine at all.

I don't have any actual figures to back this up, but my gut tells me that the fact that enterprise GPUs are an order of magnitude (at least) more expensive than, say a, 3090, means that the payback period of them has got to be pretty long. I also wonder whether setting the max power on a 3090 to a lower than default value (as I suggest in my other post) has a significant effect on the average W/token.

Agreed, but there are other costs associated with supporting 10-16x GPUs that may not necessarily happen with say 6 GPUs. Having to go from single socket (or Threadripper) to dual socket, PCIE bifurcation, PLX risers, etc.

Not necessarily saying that Quadros are cheaper, just that there's more to the calculation when trying to run 405B size models at home

The system I outlined in my other post [0] has ten GPUs and does not require dual socket CPUs as far as I'm aware. It could likely scale easily to 14 GPUs as well (assuming you have sufficient power), with an x8/x8 bifurcation adapter installed in each PCIe slot. This is pushing the limits of the PCIe subsystem I'm sure, but you could also likely scale up to 28 GPUs, again assuming sufficient power, by simply bifurcating at x4/x4/x4/x4 vs x8/x8.

I think it should work as-is with the components listed, but if you disagree please let me know!

[0] https://news.ycombinator.com/item?id=41047689

I don't think this is correct. 5 years power usage of 4090 is $2600 giving TCO of ~$4300. RTX 6000 Ada starts at $6k for the card itself.

https://gpuprices.us

To be fair, you need 2x 4090 to match the VRAM capacity of an RTX 6000 Ada. There is also the rest of the system you need to factor into the cost. When running 10-16x 4090s, you may also need to upgrade your electrical wiring to support that load, you may need to spend more on air conditioning, etc.

I'm not necessarily saying that it's obviously better in terms of total cost, just that there are more factors to consider in a system of this size.

If inference is the only thing that is important to someone building this system, then used 3090s in x8 or even x4 bifurcation is probably the way to go. Things become more complicated if you want to add the ability to train/do other ML stuff, as you will really want to try to hit PCIE 4.0 x16 on every single card.

With 2x 4090 you will have 2x speed of RTX 6000 A. So same energy per token.

Will need more space, true.

Yeah, after digging more into RTX 6000 Ada cards, I don't see any way they'd be more economical even over many years, no matter how you slice it.