Hacker News new | ask | show | jobs
by Mernit 1015 days ago
Sharing GPUs only really makes sense for GPUs that are large enough to share. MIGs can work for 80Gi A100s but won't work with smaller cards like T4s. It also adds latency to the GPU operations. Unfortunately there's not yet a silver bullet for this stuff.
1 comments

That’s why I was curious about utilization since you mentioned low memory usage. I believe time slicing can work on those smaller cards these days. Did you explore any other optimizations like batching or concurrency for same model?

Model heterogeneity seems like a real challenge there — you could optimize usage if you know all the sizes ahead of times and actually have gpu capacity to do efficient allocations, but it’s way harder than just doling 1 gpu per pod.

e: also, latency because of reduced resources? Or what do you mean?