Hacker News new | ask | show | jobs
by marmaduke 2783 days ago
this article (https://towardsdatascience.com/uber-introduces-pyml-their-se...) does a better job motivating PyML, or maybe I'm just more awake now. In any case, I see what you mean. The GitLab CI setup we have builds Docker images out of our models, and we use branch names to target datasets, so "production" usage is "just" creating a branch, watching it run, checking results, etc.

Maybe a missing detail is that our models are run-once, once results are QA'd, they are sent to relevant practitioner, so Uber's query-per-second stuff is irrelevant for us (for now), which I can see simplifies the deployment question enormously.

2 comments

Hello, Community Advocate from GitLab here. I was reading through your comments and it's great to hear how you use GitLab for your setup. Thanks for sharing your story with the community and we'd love to hear more from you on how GitLab helps you.
> we'd love to hear more from you on how GitLab helps you.

do you have specific questions?

We wanted to hear what features you like using the most and how do those features help you with setting up your project. However, you wrote https://news.ycombinator.com/item?id=18384804 which answers a lot of the questions. Thanks!
Interesting. In that case, why do you even use Docker ? Does it simplify distribution of models easier ?

Would love to know more about your packaging setup - the branch name to divide datasets is a nice trick (I'll use it as well).

How does your CI know where to find models ? Im betting you are using some kind of convention here - one model per py file...so package each py file in a docker container.

If it is possible, would love to see the skeleton structure of one of your pre-packaged files.

Tldr - it seems you invented something like pyml as well. Are the deployment scripts+model skeletons open source ?

Our GitLab instance has a lot of projects and it’s been helpful for the users to have a set of template projects each with their own Docker image. Some of those images are many gigabytes in size, tricky env vars etc. Docker “democratized” CI for most of our scientific personnel who aren’t devs, since they can hit the Fork button and have a working CI config to base their project on.

In the ML projects, it serves mainly to package dependencies, and to ensure some basic security constraints: raw datasets are accessible read only, ensuring that if we suspect some issue with cached results (cause our inner orchestrator is Make..) we can nuke all the results and start over from scratch, sure the raw data is intact.

The models and arguments are in the CI config. No magic there, but since it’s all in the repo I’m ok with it.

This whole setup was put together for an upcoming clinical trial as steps toward ISO quality norms compliance, and I can’t share it now. I do intend to reproduce it in an open form alongside our existing software (GitHub.com/the-virtual-brain) when it’s ready.

In any case I appreciate your questions a lot: they drove me to think a little harder and see why stuff like Michelango and PyML is stuff that even we (academic/clinical) group should be using... if we can find the time to do it.