| HN Mirror

power is also a good proxy. For example, we've had distributed runs that we monitored on WandB where one of our workers died in the middle and the rest were basically stalling on the dead worker. On WandB, we were only logging GPU stats on one worker and that one had 100% util but basically no excess power draw compared to having nothing running, which is how I found out something was stalling. Restarting fixed it and got the power draw up to normal, but even with high power draw, we were still having some sections of code with low SM efficiency (~20%) for that training.