Hacker News new | ask | show | jobs
by joaquincabezas 740 days ago
Even if this scale is massive and second-to-none, it’s funny how some issues are the same for all of us. In particular “Bad hosts are very bad” aka “a chain is only as strong as its weakest link” can happen with as little as a few (4) machines, and then ruin your day.
1 comments

This is really unique to the AI training clusters for reasons I'm not super clear on. Most other types of horizontally scaled workloads can sort of tolerate a slightly underperforming host, or hosts going bad every so often, with little P99/P99.9 impact. For some reason, AI training workloads really cannot.
When training a large model you're doing a single forward pass + backprop over multiple Infiniband-connected nodes for a single model instance, so if one node goes down it takes a logical unit of nodes down with it. For reference, GPT-4 was rumored to be around 1.7T, and doing some back-of-the-hand math[1], that's like 500-700 H100 GPUs per model instance, which means you need a multiple of that for any training parallelism whatsoever.

[1] back-of-the-hand-math: 1.7T * 4 bytes = 6.8 TB; 3-4x that for activation + gradients = 27.2 TB; 27.2TB / (80GB / H100) = 349 H100s; 1.5-2x conservative multiplier accounting for not fully using node resources + memory overhead in the machine = ~500-700 H100s.

truly insane numbers.

That trillion+ parameter count is the sum of each of the "experts", right?
The ever-circulating rumour is 1.7T - 1.8T for the whole thing. But it is not very substantiated, mostly started by SemiAnalysis and geohot based on rather loose speculation (such as API latency and price), and not much solid evidence to confirm it after that.

And of course, it must have changed substantially with GPT-4-Turbo and GPT-4o. It would make sense if the cost reduction was larger than the price reduction, they probably have a higher profit margin now, and the price reduction has been very significant since GPT-4 release.

This is because everyone is training with synchronous sgd. all gpus need to synchronize on each gradient step so tail latency will kill you.
I’ve worked at companies with async training. Async training does help on fault tolerance and also can assist with training thoroughput by being less reliant on slowest machine. It does add meaningful training noise and when we did experiments against sync training we got much more stable results with sync training and some of our less stable models would even sometimes have loss explosions/divergence issues with async training but be fine with sync training.

Although even for async training generally I see dataset just sharded and if worker goes down then shard of data may be loss/skipped not some kind of smarter dynamic file assignment factoring when workers go down. Even basic things like job fails continue from last checkpoint with same dataset state for large epoch is messy when major libraries like tensorflow lack a good dataset checkpointing mechanism.