Hacker News new | ask | show | jobs
Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints (micahlerner.com)
3 points by mlerner 871 days ago