Hacker News new | ask | show | jobs
by korbonits 35 days ago
Have you considered using vLLM on top of Ray Serve (on EKS with KubeRay)? KubeRay makes Ray cluster-aware and there could be some optimizations you could make e.g. keeping that GPU fully utilized all the time :)
1 comments

Thanks for the suggestion! Have you found that Ray Serve’s built-in autoscaling plays nicely with custom SLO-based concurrency limits, or do you usually let Ray handle the load balancing entirely?"
To be honest, I don't know because I have not hit many of those limits due to what I would call "moderate" scale. So far, I have just provisioned enough pods to handle the traffic as-is without using KubeRay. So k8s is handling the load balancing adequately at the moment, but Ray serve is not cluster-aware, only pod aware, for now.