Hacker News new | ask | show | jobs
by Mernit 1015 days ago
What are you using for K8s autoscaling? We initially tried a few standard K8s scaling mechanisms and found that they didn't work well for GPU workloads. For example, if we were serving a low-RAM Huggingface model on GPU, it wouldn't trigger autoscaling. But since the GPU can only process one request at a time, the system would get bottlenecked while it waited to process requests one-by-one.

We wrote a bit about this here, if anyone is interested: https://www.beam.cloud/blog/serverless-autoscaling

1 comments

If you aren’t fully utilizing gpu memory, can you use time slicing? Or MIG for A100?

Orthogonal solution but could help.

Sharing GPUs only really makes sense for GPUs that are large enough to share. MIGs can work for 80Gi A100s but won't work with smaller cards like T4s. It also adds latency to the GPU operations. Unfortunately there's not yet a silver bullet for this stuff.
That’s why I was curious about utilization since you mentioned low memory usage. I believe time slicing can work on those smaller cards these days. Did you explore any other optimizations like batching or concurrency for same model?

Model heterogeneity seems like a real challenge there — you could optimize usage if you know all the sizes ahead of times and actually have gpu capacity to do efficient allocations, but it’s way harder than just doling 1 gpu per pod.

e: also, latency because of reduced resources? Or what do you mean?