| Having put many models into production in an almost real time environment (ad servers that need predictions in < 10 ms) I would say a large portion of what you need depends on your production requirements. The system I work on is very high volume 100k request / sec and we have massive amounts of data so it complicates everything. First, I would highly recommend wrapping your ML models in some kind of microservice. Depending on your production requirements and if the ML is in Python a fairly simple Flask/Sanic web server should be sufficient. This is great because you can leave all your feature transformation code as is in Python. If your production environment has very low latency requirements you are going to have some work cut out for you. You'll most likely have to rewrite all your transformation code in a faster language like Go or Java. You might also need to implement the inference code as well to get the speed you need. This adds considerable time and adds a ton of surface error for potential insidious bugs. The ML will still make predictions, but they will be wrong or very slightly wrong. Because I'm working with larges of amounts of data and my source of truth is Parquet logs in S3, the pipelines start with Spark. We do as much data wrangling as possible in Spark to get things into a manageable size to create our train/dev/test sets. This data gets uploaded to S3. The datasets are then trained on EC2 instances using Pandas & sklearn. When everything is fully automated the Spark job will push a message onto an SQS queue with the S3 path of the fresh dataset. An EC2 instance will be polling that queue and pull down the data and train a new model. The final result of training my case is a text or binary model file that goes back up to S3. Our prediction microservice polls an S3 bucket and pulls down any updated model files and swaps out the running models. Tips: 1. Instrument everything! Hopefully you have something like graphite/datadog/prometheus in place already, but you'll want metrics on your predictions.
2. Exception tracking on everything especially anything in your model creation pipeline. Sentry or something like that.
3. Try and keep everything as simple as possible.
|
One piece did give me a bit of surprise:
I've never had the super-low-latency requirements you have, but as you point out this seems amazingly error-prone. I'd love to hear anything else you can share about the cost-benefit analysis you do before deciding to go this route, and if there's any tools or languages you've had a better time with than others for this.