| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by whack 1056 days ago
	> Rather than each of K startups individually buying clusters of N gpus, together we buy a cluster with NK gpus... Then we set up a job scheduler to allocate compute In theory, this sounds almost identical to the business model behind AWS, Azure, and other cloud providers. "Instead of everyone buying a fixed amount of hardware for individual use, we'll buy a massive pool of hardware that people can time-share." Outside of cloud providers having to mark up prices to give themselves a net-margin, is there something else they are failing to do, hence creating the need for these projects?

3 comments

tikkun 1055 days ago

Couple things, mostly pricing and availability:

1) Margins. Public cloud investors expect a certain margin profile. They can’t compete with Lambda/Fluidstack’s margins.

2) To an extent also big clouds have worse networking for LLM training. I believe only Azure has infiniband. Oracle is 3200 Gbps but not infiniband, same for AWS I believe. GCP not sure but their A100 networking speeds were only 100 Gbps I believe rather than 1600. Whereas lambda, fluidstack and coreweave all have ib.

3) Availability. Nvidia isn’t giving big clouds the allocation they want.

bravura 1055 days ago

What is your differentiator from Lambda? That you are smaller and in a single DC?

Sincere question.

tikkun 1055 days ago

I'm not OP/submitter, but the main differentiator is that Lambda doesn't have on-demand availability for lots of interlinked H100s - you have to reserve them.

Lambda has "Lambda Sprint" which is kinda similar,[1] but Sprint is $4.85/GPU/hr instead of <$2.

So if you want 128 GPUs for a week, you can't use Lambda reserved (3 year term), you can't use Lambda on-demand (can't get 128 A/H100s on-demand), your options are Lambda Sprint or SF Compute, and SF Compute is offering significantly lower prices.

[1]: https://lambdalabs.com/service/gpu-cloud/reserved

TylerE 1055 days ago

Low margins and “will this thing still be around in 2 years” are negatively correlated.

Where’s the capital for upgrades, repairs, and replacements coming from?

littlestymaar 1055 days ago

Using investor's money to build something with low to zero margin until you capture enough value to make it profitable a few years down the line has been the core SV strategy for more than a decade now, so it's not an extraordinary plan.

Of course it doesn't always work, and it may be even harder to make it work in the current macroeconomic environment, but it's still pretty standard play.

aabhay 1055 days ago

They are working on this. All the major clouds have initiatives to do short term requests/reservations. It’s just not a feature that has ever been of much use pre-GenAI. How often do you need to request 1000 CPU nodes for 48 hours in a single zone?

Secondly, there is a fundamental question of resource sharing here. Even with this project by Evan and AI Grant (the second such cluster created by AI Grant btw), the question will arise — if one team has enough money to provision the entire cluster forever, why not do it? What are the exact parameters of fair use? In networking, we have algorithms around bandwidth sharing (TCP Fairness, etc.) that encode sharing mechanisms but they don’t work for these kinds of chunky workloads either.

But over the next few months, AWS and others are working to release queueing services that let you temporarily provision a chunk of compute, probably with upfront payment, and at a high expense (perhaps above the on demand rate).

whimsicalism 1055 days ago

> It’s just not a feature that has ever been of much use pre-GenAI. How often do you need to request 1000 CPU nodes for 48 hours in a single zone?

I would srgue this has always been a common case for cloud GPU compute

beachy 1056 days ago

AWS and Azure would slit their own throats before they created a way for their customers to pool instances to save money.

They want to do that themselves, and keep the customer relationship and the profits, instead of giving them to a middleman or the customer.

jiggawatts 1055 days ago

It’s just corporate profits combined with market forces, not a some sort of malicious conspiracy.

You can rent a 2-socket AMD server with 120 available cores and RDMA for something like 50c to $2 per hour. That’s just barely above the cost of the electricity and cooling!

What do you want, free compute just handed to you out of the goodness of their hearts?

There is incredible demand for high-end GPUs right now, and market prices reflect that.

beachy 1055 days ago

You mentioned malicious conspiracy, not me.

It's just business and I'd do the same if I was in charge of AWS.

alex_lav 1055 days ago

> You can rent a 2-socket AMD server with 120 available cores and RDMA for something like 50c to $2 per hour.

Source required

jiggawatts 1054 days ago

https://news.ycombinator.com/item?id=36950422

mikeravkine 1055 days ago

Sorry where are these .50c many core servers you speak of exactly?

jiggawatts 1054 days ago

Azure's HB120rs_v3 size is about 36c per hour right now with Spot pricing in East US. These use 3rd generation AMD EPYC "Milan" processors.

The instances with the 4th generation "Genoa-X" processors (HB176rs_v4) cost about $2.88 per hour. The HX176rs_v4 model with 1.7 TB of memory is $3.46 per hour.

https://learn.microsoft.com/en-us/azure/virtual-machines/hbv...

https://learn.microsoft.com/en-us/azure/virtual-machines/hbv...

https://learn.microsoft.com/en-us/azure/virtual-machines/hx-...

alex_lav 1054 days ago

Are these actually attainable, as in I can log in and launch an instances with these specifications right now, or are they just listings? I ask because literally last week I was unable to launch similar instances on AWS despite those specs being listed as available and online.

jiggawatts 1051 days ago

I could. Availability tends to be region-dependent with all clouds.

megakwood 1055 days ago

Where can you get 120 cores for $2/hr?

jiggawatts 1054 days ago

https://news.ycombinator.com/item?id=36950422

asdfaoeu 1055 days ago

AWS and Azure both charge by the hour anyway so it wouldn't but if you wanted you could use Reserved instances and just have their accounts in the same organisation.

A large part of the profit comes from the upfront risk of buying machines. With this you are just absorbing that risk which may be better if the startup expects to last.