I don't have any books specific to MLOps, just because they weren't out when I was building that system. All of the good practices from building resilient distributed systems apply. Designing Data-Intensive Applications is always a great read.
Some things that have were notable:
Model pipelines tend to be flakier than other pipelines you have. They are much more complicated, and it can be easy to hit a resource limit if you aren't careful, or have a unhandled exception accidentally kill a pipeline 10hrs into it.
Avoiding those outright is obviously the best path, but that can be easier said than done.
One thing that we found really helpful was creating an error record in a database for every piece of data that failed to get processed, where it failed in the pipeline, etc. Retries, and alters were easy to tack on after that.
Some things that have were notable:
Model pipelines tend to be flakier than other pipelines you have. They are much more complicated, and it can be easy to hit a resource limit if you aren't careful, or have a unhandled exception accidentally kill a pipeline 10hrs into it.
Avoiding those outright is obviously the best path, but that can be easier said than done.
One thing that we found really helpful was creating an error record in a database for every piece of data that failed to get processed, where it failed in the pipeline, etc. Retries, and alters were easy to tack on after that.