| I've been an MLOps Engineer for around 3 years now and I mostly agree with the article. There is a big overlap between the ML-specific tools that are popping up on the market and traditionnal Data Engineering tools, and I think people are not always realizing that: * Prometheus/Grafana/TSDB/... can be used to setup a model monitoring platform since you're observing metrics whether they are from an ML service or a normal service. * Any service deployment tool can be used to deploy ML models, since they are services. * AirFlow/Dagster/... can be used to orchestrate model training, since training a model is basically a data engineering task. With that said, I still believe that there is space for ML-specific tools to be created. * Model Monitoring tools (ArizeAI is the only one I've used) can be tailored to be easily usable by ML Engineers without requiring DE knowledge. * Deploying models in production has some specifities: things like GPU support, adaptive batching, ... Those specifities can be implemented inside a model deployment tool. * Training orchestration is the only domain where I think there's truly no need for new tools. |
In the end the most successful platform I built was a custom orm I built around redis objects and queues and the most important part wasn't actually the fancy data processing platform, but actually the details of the container layers, the refactoring of the code to make it easily composable, releasable and easy for the scientists to play with but with enough guard rails so they wouldn't diverge too far from the structure.
It made incredibly fast at iterating. Of all the things I worked with Airflow was the one I was most hyped about from all the videos I had seen and which turn out to be the biggest mess of them all.