Hacker News new | ask | show | jobs
by kahnjw 2479 days ago
Why don't you just checkpoint the model every n steps? NNs fail for a myriad of reasons, you can easily reduce risk by routinely saving state.
1 comments

After my last oom-party, I now have it checkpointing every 1000 steps (way too often, I think, but there's plenty of disk), but I just really really want it to complete a full run. ;)