| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by avital 1220 days ago
	This isn't accurate. The bottleneck in very-large-scale-training BY FAR is communication between devices. If you have a million CPUs, the communication cost will be significantly higher than a thousand A100s (perhaps in the order of 100x or even more). So this is only possible to replicate with very dense and high compute chips with extremely fast interconnect.

1 comments

Thanks for providing this insight. Is A100 the only platform? Can we pause/resume all such platforms simultaneously?