| HN Mirror

Start with Nsight Systems and turn on GPU metrics. It’s super easy and the plots will give you an immediate sense of your utilization, and low-hanging optimization opportunities.

So using 10-wide parallel processing took your batch from 21 seconds down to 16 seconds, did I do the arithmetic correctly? That suggests the single-threaded version isn’t too bad. I mean a 25% improvement is great and nothing to sneeze at, but batching might only be trimming the gaps in between image pairs, or queueing up your memory copies while the previous inference is running. You can verify this with nsys profiles.

> i wonder if there is a fairly obvious way to truly utilize my beefy L40S GPU! is it MPS?

No idea, it’s not always easy (and generally speaking gets harder and harder as you approach 100%), but first profile to see what your utilization is before going down any big technical route. Maybe with your ThreadPoolExecutor, you’re already getting max utilization and using MPS can’t possibly help.