Hacker News new | ask | show | jobs
by electronbeam 615 days ago
The real money is in renting infiniband clusters, not individual gpus/machines

If you look at lambda one click clusters they state $4.49/H100/hr

3 comments

I'm in the business of mi300x. This comment nails it.

In general, the $2 GPUs are either PE venture losing money, long contracts, huge quantities, pcie, slow (<400G) networking, or some other limitation, like unreliable uptime on some bitcoin miner that decided to pivot into the GPU space and has zero experience on how to run these more complicated systems.

Basically, all the things that if you decide to build and risk your business on these sorts of providers, you "get what you pay for".

> slow (<400G) networking

We're not getting Folding@Home style distributed training any time soon, are we.

Distributed training data creating & curation is more useful and feasible. Training gets cheaper 1.5x every year, but data is just as expensive, if not more, given that the era of "free web crawls of human knowledge" is over.
I agree with you, but as the article mentioned, if you need to finetune a small/medium model you really don't need clusters. Getting a whole server with 8/16x H100s is more than enough. And I also believe with the article when it states that most companies are finetuning some version of llama/open-weights models today.
Exactly, it covered in the article that there is a segmentation happening via GPU cluster size.

Is it big enough for foundation model training from scratch = ~$3+ Otherwise it drops hard

Problem is "big enough" is a moving goal post now, what was big, becomes small

so why not buy up all the little h100s and enough together for a cluster? seems like a decent rollup strategy?

ofcourse it woudl still cost a lot to do... but if the difference is $2/hr vs $4.49/hr then there's some size where it makes sense

Only if they're networked with Infiniband.
Makes sense, though only folks like runpod / sfcompute / etc, have enough visibility to maybe pull this off?

Its a risker move - then just taxing the excess compute now, and print money on the margins from bag holders

Correct me if I'm wrong, but if I recall, neither of those two companies own their own compute. They are marketplaces.
Yup, but they at-least know where all these "small unused clusters" are.

Bag holders, do not want to be shouting to the world they are bag holders.

I think sfcompute does own a lot or most of the current compute on their platform? Not entirely sure though.