| HN Mirror

My experience is bulk of the problem is insufficient monitoring. ML systems need heavy monitoring and should be sending lots of metrics to stuff like prometheus/grafana. There should also be validation/consistency checks for all data pipeline/feature transformations. And you should strongly avoid duplicating logic for stuff like feature preprocessing. I've seen people implement "same" feature preprocessing pipeline twice (one python, one java) and it is so common to find edge case bugs for a long time especially when these bugs only slightly impact model behavior.

Another issue is proliferation of data pipelines. The more distinct pipelines you have, the more painful they become to monitor. It is much better to minimize pipelines and do views on a small number. I think proliferations of models is a similar issue. It is often easier to build 4 models instead of 1 multi-task model, but monitoring/operational tasks grow more and more painful as you manage more models.