Hacker News new | ask | show | jobs
by Longwelwind 1178 days ago
I've been an MLOps Engineer for around 3 years now and I mostly agree with the article. There is a big overlap between the ML-specific tools that are popping up on the market and traditionnal Data Engineering tools, and I think people are not always realizing that:

* Prometheus/Grafana/TSDB/... can be used to setup a model monitoring platform since you're observing metrics whether they are from an ML service or a normal service.

* Any service deployment tool can be used to deploy ML models, since they are services.

* AirFlow/Dagster/... can be used to orchestrate model training, since training a model is basically a data engineering task.

With that said, I still believe that there is space for ML-specific tools to be created.

* Model Monitoring tools (ArizeAI is the only one I've used) can be tailored to be easily usable by ML Engineers without requiring DE knowledge.

* Deploying models in production has some specifities: things like GPU support, adaptive batching, ... Those specifities can be implemented inside a model deployment tool.

* Training orchestration is the only domain where I think there's truly no need for new tools.

5 comments

I've used dozens of platforms, distributed job queues and pipelining tools, including airflow, pachyderm and a bunch of others. most of them turned out to be more effort that it was worth and designed around a very specific use case. Some of them looked fantastic but then had all sorts of weird cases to account for. Kinda like how ArgoCD looks great, but has a bunch of common bugs that nobody seems to care enough about to fix.

In the end the most successful platform I built was a custom orm I built around redis objects and queues and the most important part wasn't actually the fancy data processing platform, but actually the details of the container layers, the refactoring of the code to make it easily composable, releasable and easy for the scientists to play with but with enough guard rails so they wouldn't diverge too far from the structure.

It made incredibly fast at iterating. Of all the things I worked with Airflow was the one I was most hyped about from all the videos I had seen and which turn out to be the biggest mess of them all.

External blackboards and functional code that operate over them continues to pay dividends. I too have had an amazing amount of success with this pattern. The Redis api is just so damn nice.
Yet, everyone misses reproducibility and data versioning :)

So, talking about monitoring, training, and recording model drift is only a single side of the domain.

> Yet, everyone misses reproducibility and data versioning :)

Delta Lake/Apache Iceberg solves that.

Absolutely not.

A single vendor/tech does not "solve" anything when the task at hand implies you need to entirely re-design data pipelines, ML modelling and benchmarking.

LMAO no, no it doesn't and has major migration consequences for existing data warehouses.

Reproducibility is more than just upstream data versioning.

Not to mention dependency management! Since a lot of ML code is in Python this ends up being a very tricky thing to handle at scale (especially if you need to update dependencies, etc.)
I didn’t care much for Docker until I started working in the Python ecosystem.
May I recommend looking into flyte.org, it is open source kubernetes native "orchestration" style tool, but essentially and infrastructure component that is geared to making your ML Engineers and Data scientists more productive. I think iteration velocity, dynamic infrastructure management and trackability are really important and fundamentally different needs of such products.

PS. I am a maintainer at flyte.org (thoughts are my own)

Training orchestration does need new tools to spin up GPU instances and make the most of them and then spin them down, we are still struggling in this domain
I don't know what tools you are using but this can be achieved with Airflow on k8s, for example:

* Add a GPU resource requirement on one of your step

* Add an auto-scaler that adds GPU nodes to your cluster based on the GPU resource demand.

After having written the above, I realize that it might sound like that famous HN comment about how you can /easily/ re-create Dropbox yourself, which might actually prove your point that there is a need for ML-specific tools for the training part.

Having to setup and run Airflow on K8s is a hell of a prerequisite step to getting cost-efficient and fast access to GPU training.

Airflow is also absolutely not built for that purpose. It's ~10yr old Hadoop-era technology.

As for getting airflow on k8s in the first place, the apache airflow helm chart pretty much just handles things, doesn't it? It might be a pain to manage many deployments for many teams, but going from 0 to 1 isn't so bad.

As for configuring the kubernetes pod operator to ask for pods with GPU's, it exposes the k8s python API in the dag definition. I haven't done it myself, but I think that it's not really airflow that's going to be a pain there. Getting the pod spec right is gonna have to happen whatever does the orchestration.

(Full disclosure: my employer offers airflow as a service)

I agree with you that there is still room for improvement when it comes to the efficiency and effectiveness of training orchestration tools. It's true that setting up and spinning down GPU instances can be challenging, and optimizing the use of these resources is essential given their cost.
Yup your just waiting 10 minutes to add a GPU node, nothing to see here
Just use https://modal.com/ :)

At Canva I built auto-scaling GPU infra on K8s for model training[1], and it's way too much work and operational expense to be worth building yourself. I went work at Modal because building it properly once and then distributing the solution was going to be just way better and more efficient.

1. https://canvatechblog.com/supporting-gpu-accelerated-machine...

I'm not sure why you are getting downvoted, probably because people feel you are advertising Modal.

But I have to say something about Modal. The difference with this vendor is that they try to reimagine the way people build on the Cloud and it's worth checking out just to see how different the developer experience could be.

I know that most people use it because of the easy and affordable access to GPUs, but I think we are missing the true innovation here, which is the developer experience.

I would even consider Modal as a cloud infra product, although a vertical one, more than an ML or DE product.

*edited to fix some spelling*

Didn't realize it was downvoted, but fair enough if people feel it's too much of an ad. Comment is sitting at 2 points now :)

Glad you really get what we're trying to do with Modal. You're right it's not just an easy way to get serverless GPUs.

Modal is reimagining software development practices for the cloud era. Developing in the cloud should not be just writing YAML or Hashicorp Config Language templates, push/pulling Docker images, and re-running 'infratool up' over and over until things over.

I talk to people who want to set up infra to use cloud GPUs and many of them say "I want to use Modal".

Common reasons not to include (1) "I have soooo many AWS credits that I want to use" and (2) (our company's reason) "We have on-prem GPUs but sometimes need Cloud GPUs as well with the same interface".

Using e.g. Ray with AWS is very painful, took us a long time to iron out all the quirks.

Yep AWS/GCP/Azure credits are a common reason. It's been discussed within the team, and we should work something out for that.
Why isn't that solved by k8s + a node autoscaler such as Karpenter?
That's a viable solution, but since GPU instances are expensive, you really want to make most of it. Ideally the GPU should be busy within 30 seconds of instance launch.

Okay, so, where is your training data? Is your training data in the layout which your training code can just linearly scan on S3? Or you have to transform them first? Or provision a dataset cache on-demamd? Is this data engineering or training orchestration?

> Okay, so, where is your training data? Is your training data in the layout which your training code can just linearly scan on S3? Or you have to transform them first? Or provision a dataset cache on-demamd? Is this data engineering or training orchestration?

Not claiming this is the only or best solution, but the way my team solved that was by creating an internal Python lib with common happy paths to access our infrastructure and processes. We deploy our data pipelines as FastAPI services and call them using Airflow. This architecture has scaled really well: we have 300+ data pipelines, even more schedules and 3 engineers. We use Knative so our AWS bill is quite cheap for the number of services we are running.

It all boiled down to treating ml / data engineering problems as common software problems.

Thanks for sharing your experience.

Yeah, that's my point, it's hardly a solved problem and you have to write software for this!

Operationally very simple: ELT -> GUID-based naming convention on S3 or Lustre on FSx (name and keep if preserving data, not replication steps) -> Point GPU instance to data (e.g. Sagemaker can transfer data stored on S3 with different approaches and costs, YMMV). Poll training job. Spin down GPU when complete.

ELT = data engineering. Model architecture & training design = MLE. MLOps is the storage of the training data, monitoring of the whole process, caching of model for use in serving and deployment, and retiring of resources. MLOps has some overlap with dataops, e.g. caching of training data, serving of model as application, but monitors for different things like data/concept drift.

You don't want idle containers on gpus. Something like kserve which sits on knative which is similar to aws lambda is pretty useful and allows scaling deployments to 0. There is some request buffering before the containers and scaling based on the number of concurrent requests a container can support since almost all of these deployed model inference services are gpu and cpu bound, you don't want to route more requests than it can handle because cpu/gpu contention harms throughput.
You may want to look at run.house [0] for a pretty powerful solution to many of these problems.

[0] https://github.com/run-house/runhouse