|
|
|
|
|
by essekar
68 days ago
|
|
anything tbh. as long as you have runbook - you can try to automate actions through nvsx; it sits on top of NVSentinel.
restarting will work mostly for smaller jobs - distributed training, pretty common will need more fault tolerant methods to continue rather than just restarting. |
|