|
|
|
|
|
by Gatesyp
1721 days ago
|
|
Read through the homepage, but not entirely sure -- Why not just train on Spot Instances with a retry implemented? I see that SpotML has a configurable fall back to On-Demand instances, and perhaps their value prop is that it saves the state of your run up to the interruption + resumes it on the On-Demand instance, but why not just set a retry on the Spot Instance if its interrupted? I'm failing to see what is different about SpotML vs Metaflow's @retry decorator and using AWS Batch: https://docs.metaflow.org/metaflow/failures#retrying-tasks-w... If you're in the comment still, Vishnu, would love to hear your thoughts |
|
I've read through the docs, the one difference that comes to my mind is the automatic fallback to on-Demand and resume back to spot when available. I can't readily see a way to do this yet in Metaflow, but it's possible I've missed something.