MI250s definitely aren’t a common card to rent so only can find Runpod at $2.10 per hour each. This results in a training cost of $4838 + fine tuning of $3225. However this doesn’t include the 11TB of storage or time taken to get the setup actually running the tasks. So likely you wouldn’t see much change from $10k usd if any.
Runpod.io rents the next-gen MI300X's for $4/hr, although since they also rent H100's for $3/hr (that are easier to work with/faster for training) it might be more of a novelty.
Cheaper for the cloud company. But that doesn’t always translate to cheaper for the end user. Maybe they cost more to run or maybe there’s fewer of them so they’re more expensive to book?
At least a couple years ago, a big advantage of Nvidea cards was how much cheaper they were to run power wise-often the dies that made it into cloud level cards would be binned consumer dies.
Not sure if that’s still the case, but I’d say it’s plausible.
Impossible. Power costs for H100-like cards are dwarfed by the cost of the cards themselves. H100 at full load will consume ~$3500 (rough estimate) of power in 5 years at $0.12/kWh.
Data centers are more constrained by availability of power, and the matching cooling, than the actual bulk cost of it.
For example, I've been in situations where we had to deploy fewer hard drives per server unit than we otherwise would have, just because we knew we couldn't power & cool the racks if we fully stocked them.
- https://www.runpod.io/gpu/mi250