Hacker News new | ask | show | jobs
by belter 1262 days ago
Or $9/hour if you use Spot :-)

https://aws.amazon.com/ec2/spot/pricing/

1 comments

Hopefully your progress gets saved in time when the spot instance inevitably gets terminated in the midst of training.
"Managed Spot Training..."

"...Spot instances can be interrupted, causing jobs to take longer to start or finish. You can configure your managed spot training job to use checkpoints. SageMaker copies checkpoint data from a local path to Amazon S3. When the job is restarted, SageMaker copies the data from Amazon S3 back into the local path. The training job can then resume from the last checkpoint instead of restarting...."

https://docs.aws.amazon.com/sagemaker/latest/dg/model-manage...

If you use Horovod Elastic, I think you can avoid this problem working across a cluster of Spot instances.

https://horovod.readthedocs.io/en/stable/elastic_include.htm...