Hacker News new | ask | show | jobs
by sergiotapia 659 days ago
running GPU models and maximizing utilization is pretty opaque to me as a layman coming into the scene.

take this example: https://gist.github.com/sergiotapia/efc9b3f7163ba803a260b481... - running a fairly simple model that takes only 70ms per image pair, but because I have 300 images it becomes a big time sink.

by using ThreadPoolExecutor, I cut that down to about 16 seconds. i wonder if there is a fairly obvious way to truly utlize my beefy L40S GPU! is it MPS? I haven't been successful at even running the MPS daemon on my linux server yet. very opaque for sure!

3 comments

Start with Nsight Systems and turn on GPU metrics. It’s super easy and the plots will give you an immediate sense of your utilization, and low-hanging optimization opportunities.

So using 10-wide parallel processing took your batch from 21 seconds down to 16 seconds, did I do the arithmetic correctly? That suggests the single-threaded version isn’t too bad. I mean a 25% improvement is great and nothing to sneeze at, but batching might only be trimming the gaps in between image pairs, or queueing up your memory copies while the previous inference is running. You can verify this with nsys profiles.

> i wonder if there is a fairly obvious way to truly utilize my beefy L40S GPU! is it MPS?

No idea, it’s not always easy (and generally speaking gets harder and harder as you approach 100%), but first profile to see what your utilization is before going down any big technical route. Maybe with your ThreadPoolExecutor, you’re already getting max utilization and using MPS can’t possibly help.

Batch as many requests together as possible and your utilization will increase.
totally agreed. A lot of our findings during this process is that there's still a lot of alpha in finding the right kernels for the job/model. We're hoping that in the future `torch.compile` will become more mature because current docs on performance at least on pytorch side definitely leave us wanting more