Hacker News new | ask | show | jobs
by essekar 95 days ago
One of the feedbacks we got while testing was - we might even reduce the duration of checkpointing. Which was a huge insight.

We are testing out & would love more collaboration from the community - if your team is running training jobs, hit us up.

1 comments

I agree. If you can reliably recover from failures by migrating instead of restoring from checkpoints, then the need to checkpoint frequently becomes far less important and reduces the overhead of taking checkpoints. Similarly, a lot of users are experimenting with asynchronous checkpoints to reduce the blocking penalty, but there are big tradeoffs there with DRAM usage. So being able to take checkpoints far less frequently gives you more flexibility on the type of checkpointing you use as well.