Hacker News new | ask | show | jobs
by Patrick-STH 1217 days ago
There are a few big ones: - The CUDA license does not allow you to use GeForce in the data center. In the US it has become less popular, but if you look at our Inspur AIStation piece, that was a cluster located in China with GeForce cards. So it still happens, but less so. - The memory capacity is another big challenge. Newer models have 80GB which dwarfs the 24GB on a 4090. We just got the RTX 6000 Ada in, so that is an option for more memory. - For higher-end training, one of the big challenges is interconnect, so having NVLink and Infiniband or 100GbE+/ Infiniband NICs is important. The HGX A100 platform is designed for that with its NVSwitch and PCIe switch topology.

With all of that said, you are 100% right that many startups have used consumer cards for years. For example, Andrej Karpathy talked about how our DeepLearning11 build (8x 1080 Ti's) had a ~3 month payback period versus AWS https://twitter.com/karpathy/status/924340245478256640

1 comments

Andrej Karpathy talked about how our DeepLearning11 build (8x 1080 Ti's) had a ~3 month payback period versus AWS

In 2017. Currently you can rent 8xA100 server for $8.8/hr: https://lambdalabs.com/service/gpu-cloud

At this price the payback stretches to about 3 years (taking into account average energy costs in US, and assuming 24/7 operation for the whole 3 years).

A bit of extra context, that's $8.8/hr for the 40GB A100's

The 80GB A100's will run you $12.0/hr for 8 of them.

That's 24k over 3 months which would buy and power 8 4090 GPUs by my reckoning. Of course those 8 4090s wouldn't have enough memory to run chatGPT so maybe AWS is good value after all.
This is Lambda Labs' pricing, which is significantly cheaper than AWS.