Hacker News new | ask | show | jobs
by donnygreenberg 1092 days ago
It's a great point. The funny thing is that the rest of the world just has dev, QA, and prod staging and canaries, while ML has "the 6 months it takes to translate from notebook to pipeline" or "uploading a new checkpoint/image to the platform." We can just stage properly and release through CD like everyone else does, but not spend 6 months flipping the switch. We built the ability to specify the package for a function as a git repo and a particular revision to enable this, and hopefully it means more people rely on version control as the source of truth in prod and not the most recently uploaded model checkpoint to Sagemaker. During experimentation though, it really is frustrating that many systems only allow you run what you've committed.

We've also built a basic permissioning system to control who can actually overwrite the saved version of a resource, so there are no accidents. E.g. if the prod inference blob is saved at "mikes_pizza/nlp/bert/bert_prod", you can set it so only x accounts can overwrite that metadata to point to a new model. Ideally we just inherit existing RBAC groups sometime soon.

Does that make sense? Curious if you had something else in mind as far as the danger.

1 comments

Ah, I see. The ability to push to infra is more about the development loop than prod rollouts. Prod can (and should) use CD with a well understood source of truth.

Thanks, I was misunderstanding the purpose of the feature.