Hacker News new | ask | show | jobs
by tedivm 1128 days ago
I love the full stack deep learning crew, and took their course several years ago in Berkeley. I highly recommend it.

One thing that always blows my mind is how much it is just not worth it to train LLMs in the cloud if you're a startup (and probably even less so for really large companies). Compared to 36 month reserved pricing the break even point was 8 months if you bought the hardware and rented out some racks at a colo, and that includes the on hands support. Having the dedicated hardware also meant that researchers were willing to experiment more when we weren't doing a planned training job, as it wouldn't pull from our budget. We spent a sizable chuck of our raise on that cluster but it was worth every penny.

I will say that I would not put customer facing inference on prem at this point- the resiliency of the cloud normally offsets the pricing, and most inference can be done with cheaper hardware than training. For training though you can get away with a weaker SLA, and the cloud is always there if you really need to burst beyond what you've purchased.

2 comments

> Compared to 36 month reserved pricing the break even point was 8 months if you bought the hardware and rented out some racks at a colo, and that includes the on hands support. Having the dedicated hardware also meant that researchers were willing to experiment more when we weren't doing a planned training job, as it wouldn't pull from our budget.

That’s having it both ways, of course. You can’t both recoup the hardware cost in 8 months and have “free” downtime.

Under this pricing you need at least 25%ish duty cycle to break even (in 3 years) so probably still favoring buying, but for some people that might not add up. Pricing also varies drastically between providers, so this may depend on choice there.

The thing is, this is compared to 36 months reserved instance, which also assumes 100% usage. For true on-demand you pay much more.
Reserved pricing is not on-demand.
It’s the first widespread thing where build your own makes sense in a while. Most prior workflows either needed high SLAs, were bursty, or just didn’t add up to much.