| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by execveat 1140 days ago
	Nobody in their right mind is using GCE for training. Take a look at real prices: https://vast.ai/

6 comments

simonw 1140 days ago

I got the impression that kind of thing (buying time on GPUs hosted in people's homes) isn't useful for training large models, because model training requires extremely high bandwidth connections between the GPUs such that you effectively need them in the same rack.

link

p1esk 1140 days ago

I suspect most A100s on vast.ai are actually in a datacenter, and might even be on other public clouds, such as AWS. I don't see why either vast.ai or AWS care if this was the case.

link

marshray 1140 days ago

Is there at good resource that describes the impact of bandwidth and latency between GPUs?

I assume that it's completely impractical to train on distributed systems?

link

qeternity 1140 days ago

Anyone training this size of model is almost certainly using AWS/GCE.

The GPU marketplaces are nice for people who need smaller/single GPU setups, don't have huge reliability or SLA concerns, and where data privacy risks aren't an issue.

link

mrtranscendence 1140 days ago

Well, or Azure.

link

qeternity 1140 days ago

Ha yes of course. But actually has anyone been able to get instances on Azure? Thought OpenAI had them all reserved.

link

superpope99 1140 days ago

Aren't they explicitly using TPUs in their training? Vast AI are only offering GPUs.

link

bravura 1140 days ago

These nodes typically have slow downstream, and thus are hard to use when training requires pulling a huge dataset.