| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by SethTro 1212 days ago

(1022362 + 82432) gpu-hours / 2048gpus / 5 months ~= 15% uptime.

That's only 0.08 nines of availability!

I remember in one of their old guidebooks a lot of struggle to keep their 64 machine (512 gpu) cluster running this was probably 4x the machines and 4x the number of cluster dropouts.

2 comments

Tepix 1210 days ago

They may have thrown away some models that didn't turn out great.

link

foobiekr 1212 days ago

Poor GPU utilization even when available is the rule. Truly amazing. Staging of data is probably a huge part of it.

link

pavelstoev 1212 days ago

At CentML, we profiled GPU utilization on a larger AI/ML research institute cluster. 10% to 45% range, mostly in 10% utilization range. We then offered them software optimizers (which do not affect model accuracy) to get to the 90% utilization for GPUs

link

foobiekr 1211 days ago

90% sustained utilization is quite amazing, and 10% is shockingly typical. I am a quite skeptical that this holds for training and very large data sets, of the sort where data placement comes into play, but if so, congratulations, and I hope things go well for you.

link

mirker 1212 days ago

Is it failures or is this some backfill/budget scheduling while everyone is sleeping?

link

foobiekr 1211 days ago

A lot of it appears to be non-streaming approaches to data distribution resulting in actual job behavior that looks a lot more like stage-process-clear batch jobs than what you'd want to hide the latency of data moves.

link