| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Majromax 132 days ago
	> GPUs dying during training runs was a major engineering problem that had to be tackled to build LLMs. The scale there is a little bit different. If you're training an LLM with 10,000 tightly-coupled GPUs where one failure could kill the entire job, then your mean time to failure drops by that factor of 10,000. What is a trivial risk in a single-GPU home setup would become a daily occurrence at that scale.