| Utilisation is counted by the OS, it's not exposed as a performance counter by the hardware. Thus, it's limited by the level of abstraction presented by the hardware. It's useless on CPUs as well, just to a much much lesser extent to the point of it actually being useful. Basically, the OS sees the CPU as being composed of multiple cores, that's the level of abstraction. Thus, the OS calculates "portion of last second where atleast one instruction was sent to this core" on each core and then reports it. The single number version is an average of each core's value. On the other hand, the OS cannot calculate stuff inside each core - the CPU hides as part of its abstraction. That is, you cannot know "I$ utilisation", "FPU utilisation", etc,. In the GPU, the OS doesn't even see each SM (streaming multiprocessor, loosely analogous to a cpu core). It just sees the whole GPU as one black box abstraction. Thus, it calculates utilisation as "portion of last second where atleast one kernel was executing on the whole GPU". It cannot calculate intra-GPU util at all. So one kernel executing on one SM looks the same to the OS, as that kernel executing on tens of SMs! This is the crux of the issue. With performance counters (perf for CPU, or nsight compute for GPU), lots of stuff visible only inside the hardware abstraction can be calculated (SM util, warp occupancy, tensor util, etc) The question then, is why doesn't the GPU schedule stuff on each SM in the OS/driver? Instead of doing it in a microcontroller in the hardware itself on the other side of the interface? Well, I think it's due to efficiency reasons and also for nvidia to have more freedom to change it without having compat issues due to being tied to the OS, and similar reasons. If that were the case however, then the OS could calculate util for each SM, and then average it, giving you more accurate values - the case with the kernel running on 1 SM will report a smaller util than the case with the kernel executing on 15 SMs. IME, measuring on nsight compute causes anywhere from a 5% to 30% performance overhead, so if that's ok for you, you can enable it and get more useful measurements. |