Hacker News new | ask | show | jobs
by areichenbach 658 days ago
I’ve recently been trusting gpu watt usage over utilization. Any idea how good that is as a simple proxy (if I’m just looking at nvidia-smi)?
2 comments

Power usage is indeed a better representation of GPU utilization during ML training. It has the advantage of combining many important indirect signals that aren’t visible, and avoids many downfalls of compute usage, which can go to 100% even in all-reduce deadlocks, among other scenarios.
power is also a good proxy. For example, we've had distributed runs that we monitored on WandB where one of our workers died in the middle and the rest were basically stalling on the dead worker. On WandB, we were only logging GPU stats on one worker and that one had 100% util but basically no excess power draw compared to having nothing running, which is how I found out something was stalling. Restarting fixed it and got the power draw up to normal, but even with high power draw, we were still having some sections of code with low SM efficiency (~20%) for that training.