Hacker News new | ask | show | jobs
by longbeachbass 355 days ago
Thanks for this! Learnt a lot.

Curious to understand how do we ensure that the same model instance gets requests from the same client/user? Since conversations are stateful and the model needs context from previous turns of the conversation.

Is this happening at the load balancer layer?

2 comments

It's either sticky sessions or an lb that keeps track of prior sequences and route to the instance with the largest match. https://docs.sglang.ai/router/router.html
They’re not stateful, you submit the entire history with every call. Caching of prompts etc makes it important for performance to have sticky sessions or smth at the load balancer layer
Yes, typically users send the newest user message and the full conversation history. These combined become the prompt.

Our API endpoint will try to route requests that has the same prefix to the same vLLM instance (similar to longest prefix matching in networking), and hopefully there are still some KV caches for part of the prompt there.