Hacker News new | ask | show | jobs
by alexeldeib 1025 days ago
If you aren’t fully utilizing gpu memory, can you use time slicing? Or MIG for A100?

Orthogonal solution but could help.

1 comments

Sharing GPUs only really makes sense for GPUs that are large enough to share. MIGs can work for 80Gi A100s but won't work with smaller cards like T4s. It also adds latency to the GPU operations. Unfortunately there's not yet a silver bullet for this stuff.
That’s why I was curious about utilization since you mentioned low memory usage. I believe time slicing can work on those smaller cards these days. Did you explore any other optimizations like batching or concurrency for same model?

Model heterogeneity seems like a real challenge there — you could optimize usage if you know all the sizes ahead of times and actually have gpu capacity to do efficient allocations, but it’s way harder than just doling 1 gpu per pod.

e: also, latency because of reduced resources? Or what do you mean?