| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by longbeachbass 355 days ago

Thanks for this! Learnt a lot.

Curious to understand how do we ensure that the same model instance gets requests from the same client/user? Since conversations are stateful and the model needs context from previous turns of the conversation.

Is this happening at the load balancer layer?

2 comments

cyanf 355 days ago

It's either sticky sessions or an lb that keeps track of prior sequences and route to the instance with the largest match. https://docs.sglang.ai/router/router.html

link

hhh 355 days ago

They’re not stateful, you submit the entire history with every call. Caching of prompts etc makes it important for performance to have sticky sessions or smth at the load balancer layer

link

0xjunhao 354 days ago

Yes, typically users send the newest user message and the full conversation history. These combined become the prompt.

Our API endpoint will try to route requests that has the same prefix to the same vLLM instance (similar to longest prefix matching in networking), and hopefully there are still some KV caches for part of the prompt there.

link