Hacker News new | ask | show | jobs
by saltedonion 1458 days ago
Apologies for the noob question but how would FastAPI/Flask know that the job has been successfully completed? Would the worker have to persist ml inference results somewhere and the FastAPI server poll it periodically?
2 comments

This is a very good question, with a lot of different answers depending on your use-case.

One approach is to translate the synchronous call into an async call plus polling on the webapp side. You push onto a queue, with the callback queue in the message body. But that gives you problems when you want deploy a new version of your webapp - existing connection will be disrupted and the state lost.

Since you need to deal with retries, anyway, you can move the logic into the client itself. It will get the request id on the initial response and then ask the service for results.

You see, this solution can vary wildly depending on your scalability, durability and resiliency requirements. And on your budget. Its not wild to expect the response to be big, so you might want to upload it to s3. You might use websockets, too. Technology gives you a lot of options here, of different levels of complexity and scalability ;)

Thank you for the response. Would this pattern be appropriate for something like ML inference if I’m looking for sub 200ms response time and the results objects are small? The polling interval would need to be very short, and would that become an issue?
Depending on how much time it takes to query your model. If it takes significant portion of your time budget, you could just put a reverse proxy in front of your models and make it route the traffic. I would architect your workers around homogenous traffic as it would be easier to calculate the capacity, and route the traffic, via paths, on the proxy.

In that case, you trade off safety whilst the setup is still simple. But if querying the model takes less that 100ms I would argue whether thats cpu/memory intensive at all. Remember that you still need to allocate and populate the memory, and Python doesn't free the memory back to OS.

Might be a good application of WebSockets or server-sent events with a job queue on the backend. That or polling on the client-side.