Hacker News new | ask | show | jobs
by FridgeSeal 2418 days ago
Currently:

Models and feature engineering done in python, trained locally, weights uploaded to S3. Dockerfile with a tiny little web server gets deployed through or CI/CD pipeline for serving.

Soon: Argo workflows + Polyaxon for data collection, feature engineering, training etc. Push best model tobS3, same CICD process with docker container deploys little web server onto our Kubernetes environment.

Deep learning stuff will probably use a similar setup, but with PyTorch instead of Sklearn. Would like to look at serving with ONNX exporting.

When the Julia packages evolve a little more, will be looking forward to using that in production.

2 comments

Glad to see that you are interested by using Polyaxon[0] for your MLOps. Although I was going to write a blog post about the upcoming v1.0 release of Polyaxon, I just wanted to point out that there will be a native support for different type of workflows, currently it supports parallelism and distributed learning, and in the next release there will be native support for DAGs as well. Here's a test fixture[1] of what a dag workflow will look like in Polyaxon.

Happy to answer any question or provide more information.

[0]: https://github.com/polyaxon/polyaxon

[1]: https://github.com/polyaxon/polyaxon/blob/master/cli/tests/f...

This looks increadibly exciting. What's the story around what happens after training/validation? I can't see anything on your site specific to this - do you currently (or plan to) offer anything to help track or version model "releases"?
Since it's possible to create custom components, one can for instance create a component for packaging models, for example the component can extract the model from a path (mounted volume or blob storage), create a python package with a requirements file, and a microservice based on flask, everytime a user wants to promote a model to production, she can run this component, or she can also add it to a workflow to be triggered after a training or hyperparams operation.

From Polyaxon's side, we will be providing a set of reusable components, some of these components will be targeting deployment and packaging, for instance aws lambda, azure ml, sagemaker, open source projects for deployment (some of them are mentioned in this HN thread) Users can also contribute components as well or create them to be used inside there organization.

For versioning, all component have versions, and all runs have full overview of their dependencies and provenance (inputs/outputs). This gives all information about when and how a model was created.

The platform knows how to create and manage services, e.g. tensorboards, notebooks, dash, simulators for RL agents ..., so one can also deploy the model as an internal Polyaxon service to be used as an internal tool or for testing purposes, the only thing to keep in mind is that the API endpoint will be subject to the same access rights as other components create by a given user.

Any inputs on Argo workflows vs Kubeflow vs MLFlow? Which is better suited?
I will try to be as objective as possible answering this question, since I am working on a project in a competing space.

Argo workflow is a pipeline engine that is cloud and kubernetes native. It tries to solve graph and multi-steps workflows using containers on Kubernetes, It can be leveraged for ML pipelines as well as other use-cases.

Kubeflow is a large project that has several components: training operators, serving (based on Istio and Knative), metadata (used by tensorflow TFX), pipelines, ... and integrates with other projects. Kubeflow pipelines is using Argo workflow as a workflow engine, although I think there are efforts to support other projects such as Tekton which is also a google project, and possibly TFX as a DSL for authoring pipelines in python.

The main focus for MLFlow, I think, is tracking ML models and providing an intuitive interface to model deployment and governance. The main strength of MLFlow is that it's easy to install and use.

Polyaxon has been used mainly for fast developement and experimentation, it has a tracking interface and several integrations for dashboarding, notebooks, and distributed learning. Polyaxon also has native support for some Kubeflow components, e.g. TFJob, Pytorch job, MPIJob for distributed learning.

The upcoming Polyaxon release will be providing a larger set of intergrations for dashboards, in addition to tensorboards, notebooks and jupyter labs, users will be able to start and share zeppelin notebooks, voila, plotly dash, shiny, and any custom stateless service that can consume the outputs of another operation.

The new workflow interface focuses mainly on an easy declrative way to handle DataOps and MLOps, the main idea is to provide a very simple interface for the user to go from a data transformation to training models. Since the component abstraction is based on containers, it can be used to do other operations, e.g. packaging models and preparing them to be served on other open source projects, cloud providers, or lambda functions. Also support for some frameworks such as dask, spark and flink operators could be used as a step in a workflow, ...

For hyperparams tuning, Currently, the platform has grid search, random search, hyperband, and bayesian optimization, one of the major changes in the next release is a new interface for people to create their own algorithms and a mapping interface to traverse a space search provided by the user or based on the output of another operation.

Kubeflow, unless I’m missing some things, is for Tensorflow pipelines, if you’re not using TF, or it’s not the only thing you use, it’s not ideal.

I thought MlFlow was a spark thing, and were trying to migrate off of spark/DataBricks due to the resources inefficiencies of Spark (at our scale) and maintenance nightmare that python notebooks are causing us.

Argo is just a container workflow tool, not ML specific. We’re planning on using Argo for the data engineering parts, and polyaxon for the ML training parts because of the convenient monitoring and hyper parameter search tools.

Hi! Co-founder of Kubeflow here - definitely not TensorFlow only! You can see ([1]) many many different repos and operators. The nice part about Argo for us is it let us build an ML specific DSL that was also Kubernetes native.

[1] https://github.com/kubeflow/

What about the A/B testing? What do you use for A/B strategy. How many predictions are being served by the model per second?
For most of the stuff we’ve deployed, we’re not yet operating at a scale/level of interest where A/B rearing is worth it. Additionally, the purposes we’re using most of these models for don’t really necessitate A/B testing.

When we do need A/B testing, we’ll probably use something like Seldon. As for predictions/second, not very much at the moment: 1 per 30 seconds maybe? It’s not deployed into a Kubernetes cluster because of scaling requirements, it’s because that’s where all our other services greet deployed till, and it’s more beneficial (ops and cost wise) to also deploy into there than it is to bother with having a separate workflow for deploying to lambda’s or SageMaker.

So how do you know if a new version of a model is better than the existing serving version?
As currently the only person doing data science things for the team, I’ll test to make sure changes I make to model/feature engineering/etc result in a better model. We’re not constantly, constantly retraining our models, because our incoming data and behaves the same. We’ve had the same model in prod for 4 months now; we don’t have any pressing issues with its predictions, and looking through the logs of what the input was the the output, it’s still performing as expected, so we’ll probably leave it longer.
I see, so how do you measure the difference between the incoming data and your training data?

Also, it looks like you have a very low volume of predictions?