Hacker News new | ask | show | jobs
by lWaterboardCats 637 days ago
Would love to read an MLOps lessons learned or approach you had or if you recommend any particular books that really hit the nail on the head
1 comments

I don't have any books specific to MLOps, just because they weren't out when I was building that system. All of the good practices from building resilient distributed systems apply. Designing Data-Intensive Applications is always a great read.

Some things that have were notable:

Model pipelines tend to be flakier than other pipelines you have. They are much more complicated, and it can be easy to hit a resource limit if you aren't careful, or have a unhandled exception accidentally kill a pipeline 10hrs into it.

Avoiding those outright is obviously the best path, but that can be easier said than done.

One thing that we found really helpful was creating an error record in a database for every piece of data that failed to get processed, where it failed in the pipeline, etc. Retries, and alters were easy to tack on after that.