Hacker News new | ask | show | jobs
by essekar 68 days ago
anything tbh. as long as you have runbook - you can try to automate actions through nvsx; it sits on top of NVSentinel. restarting will work mostly for smaller jobs - distributed training, pretty common will need more fault tolerant methods to continue rather than just restarting.