| "Utilization" tells you the percentage of your GPU's SM that currently have at least one thread assigned to them. It does not at all take into count how much that thread is actually using the core to it's capacity. So if e.g. your thread is locked waiting on some data from another GPU (NCCL) and actually doing nothing, it will still show 100% utilisation. A good way to realize that is when a NCCL call timeout after 30 minutes for some reason, but you can see all your GPUs (except the one that cause the failure) were at 100% util, even though they clearly did nothing but wait. Another example are operation with low compute intensity: Say you want to add 1 to every element in a very large tensor, you effectively have to transfer every element (let's say FP8, so 1 byte) from the HBM to the l2 memory, which is very slow operation, to then simply do an add, which is extremely fast. It takes about ~1000x more time to move that byte to L2 than it takes to actually do the add, so in effect your "true" utilization is ~0.2%, but nvidia-smi (and this tool) will show 100% for the entire duration of that add. Sadly there isn't a great general way to monitor "true" utilization during training, generally you have to come up with an estimate of how many flops your model requires per pass, look at the time it takes to do said pass, and compare the flops/sec you get to Nvidia's spec sheet. If you get around 60% of theoretical flops for a typical transformer LLM training you are basically at max utilization. |