Hacker News new | ask | show | jobs
by hhh 362 days ago
They’re not stateful, you submit the entire history with every call. Caching of prompts etc makes it important for performance to have sticky sessions or smth at the load balancer layer
1 comments

Yes, typically users send the newest user message and the full conversation history. These combined become the prompt.

Our API endpoint will try to route requests that has the same prefix to the same vLLM instance (similar to longest prefix matching in networking), and hopefully there are still some KV caches for part of the prompt there.