Hacker News new | ask | show | jobs
by version_five 1180 days ago
If you have ~100k to spend, aren't there options to buy a gpu rather than just blow it all on cloud? How much is an 8xA100 machine?

4xA100 is 75k, 8 is 140k https://shop.lambdalabs.com/deep-learning/servers/hyperplane...

4 comments

If you bought an 8xA100 machine for $140k you would have to run it continuously for over 10,000 hours (about 14 months) to train the 7B model. By that time the value of the A100s you bought would have depreciated substantially; especially because cloud companies will be renting/selling A100s at a discount as they bring H100s online. It might still be worth it, but it's not a home run.
If 8-bit training methods take off, I think the calculus is going to change rapidly, with newer cards that have decent amounts of memory and 8-bit acceleration starting to become dramatically more cost and time effective than the venerable A100s.
you're comparing the capital cost of acquiring a GPU machine with the operational cost of renting one in the cloud.

Ignoring the operational costs of on-prem hardware is pretty common, but those costs are significant and can greatly change the calculation.

Heh, you work at AWS or Google Cloud perhaps? ;) (Only joking about this as I constantly see employees from AWS/GCloud and other cloud providers claim that cloud is always cheaper than hosting things yourself)

Sure, if you're planning to service a large number of users, building your infrastructure in-house might be a bit overkill, as you'll need a infrastructure team to service it as well.

If you're just want to buy 4 GPUs to put in one server to run some training yourself, I don't think it's that much overkill. Especially considering you can recover much of the cost even after a year by selling much of the equipment you bought. Most of your losses will be costs for electricity and internet connection.

I used to work for Google Cloud (I built a predecessor to Preemptible VMs and also launched Google Cloud Genomics). But even before I worked at Google I was a big fan of AWS (EC2 and S3).

Buying and selling hardware isn't free; it comes with its own cost. I would not want to be in the position of selling a $100K box of computer equipment- ever.

:)

True, but some things are harder to sell than others. A100's in today's market would be easy to sell. Harder to buy, because the supply is so low unless you're Google or another big name, but if you're trying to sell them, I'm sure you can get rid of them quickly.

Cloud gives you very good price for what they offer - excellent reliability, hyper-scalability. Most people don't need either and use it as a glorified VPS host.
The issue with on premise is under utilization and the fact that you need more than just the hardware. You end up buying more hardware than you need and inevitably a portion of it will just sit there idling and depreciating in value. And you don't just need hardware but also investments in your building. GPUs generate a lot of heat. So, you need to get rid of that heat and make sure you beef up your power infrastructure to be able to handle the load. It's not just the GPUs that you pay for. And the equipment is expensive. So you need to invest in security as well.

Cloud pricing is pretty steep and obviously has a fat profit margin but building your own data centers isn't cheap either. Doing this at scale is not something most companies would be very good at either. Which means it probably is quite a bit more expensive relative to what the big cloud providers are doing.

Or from another perspective, comparing the cost of training one model in the cloud to the cost of training as many as you want on your machine, then (as mentioned by siblings) selling the machine for nearly as much as you paid for it, unless there's some shortage, in which case you'll get more back than you paid for it.

One is buying capital that produces models, the other is buying a single model.

For a single unit one could have it in their home or office, rather than a datacenter or colo. If the user sets up and manages the machine themselves there is no additional IT cost. The greatest operating expense would be the power cost.
"If the user sets up and manages the machine themselves there is no additional IT cost" << how much do you value your time?

In my experience, physical hardware has a management overhead over cloud resources. Backups, large disk storage for big models, etc.

For a server farm, sure, for one machine, I don't know. Assuming it plugs into a normal 15A circuit, and you have a we-work or something where you don't pay for power, is the operational cost of one machine really material?
it's hard to tell from what you're saying: you're planning on putting an ML infrastructure training server on a regular 15A circuit, not in a data center or machine room? And power is paid for by somebody else?

My thinking about pricing doesn't include that option because I wouldn't just hook a server like that up to a regular outlet in an office and use it for production work. If that works for you- you can happily ignore my comments. But if you go ahead and build such a thing and operate it for a year, please let us know if there were any costs- either dollar or in suffering- associated with your decision

[edit: adding in that the value of this machine also suggests it cannot live unattended in an insecure location, like an office]

signed, person who used to build closet clusters at universities

Nvidia happily sells what you're describing. They call it "DGX Station A100", it has 4 80GB A100 and retails for 80k. Not sure i believe their claimed noise level of <37 dB though.

Of course that's still a very small system when talking LLM training, the only reason why i would not put that in a regular office is it's extreme price. Do you really want something worth 80k in a form factor that could be casually carried through the door?

If you live near an inexpensive datacenter, you can park it there. Throw in a storage machine or two (TrueNAS MINI R looks like a credible low-effort option). If your workload is to run a year long computation on it and otherwise mostly ignore it, then your operational costs will be quite low.

Most people who rent cloud servers are not doing this type of workload.

Why couldn't a 75k machine live unattended in an office? If the same office has just a hundred employee workstations, those in total are worth much more than that, heck, Apple offered a Mac Pro configuration that was $50k for a single workstation.
No kidding. I worked for a company that had multiple billions of dollars invested in a data center refresh in North America and Europe.
Remember to discount the tax depreciation for the hardware and deduct any potential future gains from either reselling it or using it.
You can sell the A100 after once you're done as well. Possibly even at profit?
These are wild pieces of hardware, thanks for linking. I wonder how loud they get.