Hacker News new | ask | show | jobs
by flybarrel 841 days ago
Newbie question - What happens after when an LLM training job experience a hardware failure? I don't suppose you lose all the training progress do you? Then the pain is mostly in the diagnostic of the problem and getting the cluster running again, but no need to worry about data loss right?