Hacker News new | ask | show | jobs
by avital 1173 days ago
This isn't accurate. The bottleneck in very-large-scale-training BY FAR is communication between devices. If you have a million CPUs, the communication cost will be significantly higher than a thousand A100s (perhaps in the order of 100x or even more). So this is only possible to replicate with very dense and high compute chips with extremely fast interconnect.
1 comments

Thanks for providing this insight. Is A100 the only platform? Can we pause/resume all such platforms simultaneously?