| HN Mirror

Users expose their model to our Trial API (https://docs.determined.ai/latest/topic-guides/model-definit...), the base class then implements a training loop (which can be enhanced with user-supplied callbacks, metrics, etc.) that has a whole bunch of bells and whistles. Easy distributed (multi-GPU and multi-node) training, automatic checkpointing, fault tolerance, etc.

Concretely, the system is regularly taking checkpoints (which include model weights and optimizer state) and so if the spots disappear (as they do), the system has enough information to resume from where things were last checkpointed when resources become available again.