| As a DevOps Engineer working for a ML-based company and have had worked for others in the past, these are my quick suggestions for production readiness. DOs: If you are doing any kind of soft-realtime (i.e. not batch processing) inference, by exposing a model on a request-response lifecycle, use Tensorflow Serving for concurrency reasons. Version your models and track their training. Use something like MLFlow for that. Divise a versioning system that makes sense for your organization. If you are using Kubernetes in Production, mount NFS in your containers to serve models. Do not download anything (from S3, for instance) on container start up time unless your models are small (<1Gb). If you have to write some sort of heavy preprocessing or postprocessing steps, eventually port them to a more efficient language than Python. Say Go, Rust, etc. DO NOTs: Do NOT make your ML engineers/researchers write anything above the model stack. Don't make them write queue management logic, webservers, etc. That's not their skillset, they will write poorer and less performant code. Bring in a Backend Engineer EARLY. Do NOT mix and match if you are working on an asynchronous model, i.e. don't have a callback-based API and then have a mix of queues and synchronous HTTP calls. Use queues EVERYWHERE. DO NOT start new projects in Python 2.7. From past experiences, some ML engineers/researchers are quite attached to the older versions of Python. These are ending support in 2020 and it makes no sense to start a project using them now. |
As a scientist designing risk management systems, I also like to:
. avoid moving the data;
. bring the (ML/stats) code to the data;
. make in-memory computations (when possible) to reduce latency (network+disk);
. work on live data instead of copies that drift out-of-date; and
. write software to keep models up to date because they drift with time too and that's a major, operationally un-noticed, and extremely costly problem.
I'm not yet into Tensor/ML-Flow, but I use R, JS, and Postgres, thereby relying on open-source eco-systems (and packages) that are:
. as standard as possible;
. well-maintained;
. with a long expected support; and
. as few dependencies as possible.