| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by korbonits 35 days ago
	Have you considered using vLLM on top of Ray Serve (on EKS with KubeRay)? KubeRay makes Ray cluster-aware and there could be some optimizations you could make e.g. keeping that GPU fully utilized all the time :)

1 comments

nicoinstrument 35 days ago

Thanks for the suggestion! Have you found that Ray Serve’s built-in autoscaling plays nicely with custom SLO-based concurrency limits, or do you usually let Ray handle the load balancing entirely?"

link

korbonits 34 days ago

To be honest, I don't know because I have not hit many of those limits due to what I would call "moderate" scale. So far, I have just provisioned enough pods to handle the traffic as-is without using KubeRay. So k8s is handling the load balancing adequately at the moment, but Ray serve is not cluster-aware, only pod aware, for now.

link