| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by MrSaints 2415 days ago

We are currently experimenting with workflow engines for orchestrating different components, e.g. data ingestion, data preprocessing, feature engineering, scoring, automated decision making / escalation. Namely, Argo for offline processing, and bulk processing; Zeebe, and Cadence (trying both out) for online processing, and business logic / application services.

We don't yet have a polyglot architecture, but we do have the requirement of running distributed services (partly because there are certain components of the pipeline that needs to be run on-premise), and we have found that workflow engines / orchestrators definitely makes it a lot easier to reason about the wider architecture / have a bird's eye view. It works for us. No need to handle callbacks, events, queues, etc. We also do have the potential to run a polyglot architecture.

We tried out Celery Workflows, and struggled to get it "production ready", so I'd advise against this for complex workflows. We also found the visibility lacking.

We have yet to fully try out Kubeflow, and MLflow. What is not quite working at the moment is creating, and deploying portable models. And I don't mean simply pickling, and storing an artifact.

Leveraging containers (Docker), and slapping simple anti-corruption layers (e.g. simple web APIs) has also helped. We have a more consistent way of deploying, and isolating code without having to rewrite much.

We want to look into using Nuclio, and/or knative to ease the process of deployment, and to empower the data scientists to deliver without much engineering expertise.

Others have mentioned using base classes or standard interfaces for their models. We tried this too, but it didn't work. The generalisation early on was met with conflicting requirements, and broke the interface segregation principle (not that it matters too much, but it can be confusing to not know precisely what is being used or not used). We figured it's much easier to procrastinate any abstractions. Let the data, and it's flow do the talking.