| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Mernit 1063 days ago
	What are you using for K8s autoscaling? We initially tried a few standard K8s scaling mechanisms and found that they didn't work well for GPU workloads. For example, if we were serving a low-RAM Huggingface model on GPU, it wouldn't trigger autoscaling. But since the GPU can only process one request at a time, the system would get bottlenecked while it waited to process requests one-by-one. We wrote a bit about this here, if anyone is interested: https://www.beam.cloud/blog/serverless-autoscaling

1 comments

alexeldeib 1063 days ago

If you aren’t fully utilizing gpu memory, can you use time slicing? Or MIG for A100?

Orthogonal solution but could help.

link

Mernit 1063 days ago

Sharing GPUs only really makes sense for GPUs that are large enough to share. MIGs can work for 80Gi A100s but won't work with smaller cards like T4s. It also adds latency to the GPU operations. Unfortunately there's not yet a silver bullet for this stuff.

link

alexeldeib 1063 days ago

That’s why I was curious about utilization since you mentioned low memory usage. I believe time slicing can work on those smaller cards these days. Did you explore any other optimizations like batching or concurrency for same model?

Model heterogeneity seems like a real challenge there — you could optimize usage if you know all the sizes ahead of times and actually have gpu capacity to do efficient allocations, but it’s way harder than just doling 1 gpu per pod.

e: also, latency because of reduced resources? Or what do you mean?

link