Hacker News new | ask | show | jobs
by saltedonion 1460 days ago
Thank you for the response. Would this pattern be appropriate for something like ML inference if I’m looking for sub 200ms response time and the results objects are small? The polling interval would need to be very short, and would that become an issue?
1 comments

Depending on how much time it takes to query your model. If it takes significant portion of your time budget, you could just put a reverse proxy in front of your models and make it route the traffic. I would architect your workers around homogenous traffic as it would be easier to calculate the capacity, and route the traffic, via paths, on the proxy.

In that case, you trade off safety whilst the setup is still simple. But if querying the model takes less that 100ms I would argue whether thats cpu/memory intensive at all. Remember that you still need to allocate and populate the memory, and Python doesn't free the memory back to OS.