| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by loeg 740 days ago
	This is really unique to the AI training clusters for reasons I'm not super clear on. Most other types of horizontally scaled workloads can sort of tolerate a slightly underperforming host, or hosts going bad every so often, with little P99/P99.9 impact. For some reason, AI training workloads really cannot.

2 comments

chaos_emergent 739 days ago

When training a large model you're doing a single forward pass + backprop over multiple Infiniband-connected nodes for a single model instance, so if one node goes down it takes a logical unit of nodes down with it. For reference, GPT-4 was rumored to be around 1.7T, and doing some back-of-the-hand math[1], that's like 500-700 H100 GPUs per model instance, which means you need a multiple of that for any training parallelism whatsoever.

[1] back-of-the-hand-math: 1.7T * 4 bytes = 6.8 TB; 3-4x that for activation + gradients = 27.2 TB; 27.2TB / (80GB / H100) = 349 H100s; 1.5-2x conservative multiplier accounting for not fully using node resources + memory overhead in the machine = ~500-700 H100s.

truly insane numbers.

link

furyofantares 739 days ago

That trillion+ parameter count is the sum of each of the "experts", right?

link

oersted 739 days ago

The ever-circulating rumour is 1.7T - 1.8T for the whole thing. But it is not very substantiated, mostly started by SemiAnalysis and geohot based on rather loose speculation (such as API latency and price), and not much solid evidence to confirm it after that.

And of course, it must have changed substantially with GPT-4-Turbo and GPT-4o. It would make sense if the cost reduction was larger than the price reduction, they probably have a higher profit margin now, and the price reduction has been very significant since GPT-4 release.

link

eugenhotaj 740 days ago

This is because everyone is training with synchronous sgd. all gpus need to synchronize on each gradient step so tail latency will kill you.

link

Mehdi2277 739 days ago

I’ve worked at companies with async training. Async training does help on fault tolerance and also can assist with training thoroughput by being less reliant on slowest machine. It does add meaningful training noise and when we did experiments against sync training we got much more stable results with sync training and some of our less stable models would even sometimes have loss explosions/divergence issues with async training but be fine with sync training.

Although even for async training generally I see dataset just sharded and if worker goes down then shard of data may be loss/skipped not some kind of smarter dynamic file assignment factoring when workers go down. Even basic things like job fails continue from last checkpoint with same dataset state for large epoch is messy when major libraries like tensorflow lack a good dataset checkpointing mechanism.

link