|
|
|
|
|
by sergiotapia
659 days ago
|
|
running GPU models and maximizing utilization is pretty opaque to me as a layman coming into the scene. take this example: https://gist.github.com/sergiotapia/efc9b3f7163ba803a260b481... - running a fairly simple model that takes only 70ms per image pair, but because I have 300 images it becomes a big time sink. by using ThreadPoolExecutor, I cut that down to about 16 seconds. i wonder if there is a fairly obvious way to truly utlize my beefy L40S GPU! is it MPS? I haven't been successful at even running the MPS daemon on my linux server yet. very opaque for sure! |
|
So using 10-wide parallel processing took your batch from 21 seconds down to 16 seconds, did I do the arithmetic correctly? That suggests the single-threaded version isn’t too bad. I mean a 25% improvement is great and nothing to sneeze at, but batching might only be trimming the gaps in between image pairs, or queueing up your memory copies while the previous inference is running. You can verify this with nsys profiles.
> i wonder if there is a fairly obvious way to truly utilize my beefy L40S GPU! is it MPS?
No idea, it’s not always easy (and generally speaking gets harder and harder as you approach 100%), but first profile to see what your utilization is before going down any big technical route. Maybe with your ThreadPoolExecutor, you’re already getting max utilization and using MPS can’t possibly help.