| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by volker48 3028 days ago

Having put many models into production in an almost real time environment (ad servers that need predictions in < 10 ms) I would say a large portion of what you need depends on your production requirements. The system I work on is very high volume 100k request / sec and we have massive amounts of data so it complicates everything.

First, I would highly recommend wrapping your ML models in some kind of microservice. Depending on your production requirements and if the ML is in Python a fairly simple Flask/Sanic web server should be sufficient. This is great because you can leave all your feature transformation code as is in Python.

If your production environment has very low latency requirements you are going to have some work cut out for you. You'll most likely have to rewrite all your transformation code in a faster language like Go or Java. You might also need to implement the inference code as well to get the speed you need. This adds considerable time and adds a ton of surface error for potential insidious bugs. The ML will still make predictions, but they will be wrong or very slightly wrong.

Because I'm working with larges of amounts of data and my source of truth is Parquet logs in S3, the pipelines start with Spark. We do as much data wrangling as possible in Spark to get things into a manageable size to create our train/dev/test sets. This data gets uploaded to S3.

The datasets are then trained on EC2 instances using Pandas & sklearn. When everything is fully automated the Spark job will push a message onto an SQS queue with the S3 path of the fresh dataset. An EC2 instance will be polling that queue and pull down the data and train a new model.

The final result of training my case is a text or binary model file that goes back up to S3. Our prediction microservice polls an S3 bucket and pulls down any updated model files and swaps out the running models.

Tips:

  1. Instrument everything! Hopefully you have something like graphite/datadog/prometheus in place already, but you'll want metrics on your predictions.
  2. Exception tracking on everything especially anything in your model creation pipeline. Sentry or something like that.
  3. Try and keep everything as simple as possible.

1 comments

claytonjy 3028 days ago

Thanks for this; I really appreciate the detail here. There seems to be a lack of these kinds of explanations around.

One piece did give me a bit of surprise:

  You might also need to implement the inference code as well to get the speed you need

I've never had the super-low-latency requirements you have, but as you point out this seems amazingly error-prone. I'd love to hear anything else you can share about the cost-benefit analysis you do before deciding to go this route, and if there's any tools or languages you've had a better time with than others for this.

link

chewxy 3028 days ago

I have similar experience with GP, which is how I ended up writing Gorgonia (https://github.com/gorgonia/gorgonia). Serialization of models is easy: .npy files are very good formats though I know a number of people who will disagree - I find they typically prefer pb as a serialization format, which I think is better for over-the-wire not for storage

Also like GP, I used to work in advertising. The low latency bits are because RTB servers want you to respond within a certain amount of time. When I was in advertising, our solution was to precalculate a whole bunch of things, and throw them into redis. It was then a simple lookup of the hash of a vector (for bidding related stuff)

link

volker48 3028 days ago

Yeah, the requirements are pretty different than most Data Science teams, especially the very low latency requirements.

The constraints force us to use simple models like linear regression and logistic regression some of the time or at least as a version 1. The inference here is straightforward, multiply and add then take the sigmoid if doing logistic regression.

What we tried to do initially was integrate with C/C++ APIs where possible. We ran into some issues with speed and bugs doing this though, which is why we wrote the inference ourselves. Where we had issues was calling the XGBoost C API from Go. It was extra overhead and too slow. In our benchmarks our implementation in pure Go was many times faster than calling the C API. We also found the multithreaded version to be slower than the single threaded. We found this to be true when calling XGBoost from Java and from Go. We also found this to be true in our own inference implementation it was always faster to walk the trees in a single go routine rather than create some number of worker go routines to walk the trees in parallel.

We were very careful implementing the inference ourselves to make sure the predictions matched. What we did to verify this was create a few toy datasets of about 100 rows with sklearn's make_classification function. We then trained a model using the reference implementation, saved the predictions and the model. We then loaded this model into our implementation and made predictions on the same dataset. We wrote unit tests to compare the predictions and make sure they are the same within some delta. We were able to get our implementation to be within 1e-7 of the reference implementation, in this specific case XGBoost. It was actually more time consuming to deal with parsing the inconsistent JSON model output of XGBoost than it was to implement the GBDT inference algorithm. We also had to make a slight change to the XGBoost code to write out floats to 18 decimal places when writing out the JSON model in order to get the two implementations to match.

link

claytonjy 3028 days ago

This is all SO fascinating to me. Multiple threads slowing stuff down & 18 decimal places being relevant stick out as surprising.

Part of me is thankful to not have these problems, while another part thinks it'd be a lot of fun to do this kind of last-mile engineering.

link

volker48 3028 days ago

Yeah, I had a very hard time believing that the multithreaded approach would be slower. Its so counterintuitive since at first blush it seems that walking N trees is an embarrassingly parallel problem. I tested up to 1000 trees and single threaded was still faster. I'm sure at some point the multithreaded approach will win out, but its beyond the number of trees and max depth we are using.

link

claytonjy 3028 days ago

I've half-convinced myself it's because we're talking about GBM's and not Random Forests (where my mind goes first). One of the smart things about XGBoost is parallelizing training by multithreading the variable selection at each node, but that doesn't apply to inference; I imagine you gotta predict trees sequentially since each takes the previous output as an input? Now I wonder what those extra threads were even doing...

link

volker48 3027 days ago

Each tree can be traversed independently. Each tree is traversed until a leaf node is reached. Those leaf values are summed for all the trees. The sum of all the leafs plus a base score is then returned. In the case of binary classification that sum is passed through a sigmoid function. For linear regression the sum is returned.

link

chewxy 3028 days ago

You wrote XGBoost in Go? I've been looking for that and dreading writing one

link

volker48 3028 days ago

I wrote the inference step of XGBoost in Go. It will make predictions after loading in an XGBoost JSON model. Writing the training portion of XGBoost would be much harder.

link