Hacker News new | ask | show | jobs
by ra-mos 1640 days ago
A lot of this is dependent on what “prod” means. If production applications, you are largely just retrofitting devops processes to incorporate model apis. Nothing really changes from how you deploy/monitor your systems.

Now, production models probably means some advanced domain-specific use case. The problem with all these MLOps platforms, or model serving services, is that they are abstractions for general use cases. Capitalism effect, I wouldn’t buy into these quite yet.

To serve production models, you’re going to need to get low-level, especially if performance is a feature. You’ll need to figure out architecture designs that work best for your use case.

Eg, I serve NLP models to augment a complex enterprise search engine. These are large TensorFlow models, with large embedding spaces.

We use kubernetes, ssd optimizations for fast embedding retrieval, and custom compiled TensorFlow container images. We then have sort of a demux architecture to custom-batch every request into sub requests (100s-1000s) that fan out for inference across kubernetes. Almost every api is in Golang, and we replaced Python with Go to run the models. Flatbuffers for every request and to spend no cost in serialization as we turn every user request into 100s that return back as a single request in ~1s. Some CPU optimizations along the way, and now we’re happy with the current prod infrastructure.

AFAIK, no MLOps tool could’ve done this for us. The major new thing we do is capture every piece of metric we can, and incorporate UX as part of ML research and retraining.

1 comments

Sound awesome! Looks like you invested a lot, e.g. inference batch is not easy at all. Few follow up questions: - Do you really retrieve embedding on every batch? Doesn't it make sense to keep them in memory (must not be more than few gigs)? - Do you have/plan to incorporate inference feedback loop for retraining? Or you call it metric as well?

Thanks for an answer!

Hey,

1. Our models act as building blocks for various feature apis/algos, but embeddings are required for all tasks. We have one major task which a user request sends their business items (think documents / docIds), anywhere from 1-10,000s — here we have to quickly retrieve those. We do also do a deep retrieval (basically ANN) Which requires no lookup.

2. First batch of embeddings was 4.2TB, last reduced down to ~400GB. Not sure if ssd/embedded db (leveldb) is still best but still works fine. Embeddings grow larger weekly - through business process generation (we do a big write once a week). but also through user processing — for these we actually store in mongo since they work best there / usually user specific / and are generated ad hoc so it doesn’t make since to re-shard leveldb embeddings.

3. We use active/passive feedback mechanisms. Think thumbs up/down for active, and contextual analysis for passive (did good things happen after AI use).

#3/ your last question is the hardest. We have a small data science team figuring out the best way to build fine-tuning sets. But the production model training is vastly different and more reliable. This is the classic stability/plasticity problem. We have yet to successfuly fine tune the model with feedback data. But we have informed how to better train the model. The best case scenario will likely be as-hoc/user-specific mechanisms to change results in real-time / based on current context collected (think tiktok algo, or knows what you’re in the mood for but knows how to bounce out or keep things well varied to keep your attention).