|
|
|
|
|
by lemonademan
18 hours ago
|
|
I personally believe once you get beyond a handful of GPUs, people probably end up using both levels of telemetry because they answer different questions. NVML is nice for per-request attribution and understanding model behavior, but I believe PDU/BMC measurements are better suited for actual power draw since they capture everything (CPUs, networking, PSU losses, fans, etc.). For instance, people running 32+ GPU setups probably correlate timestamps rather than trying to preserve strict per-request attribution at the rack level. This will enable these individuals to have rack/PDU power sampled every second. Either way, I haven't seen many people publish how they instrument this in practice so take what I wrote with a gran of salt. I simple wanted to share a little bit of what I understand and I hope it helps. |
|
The power draw from the wall is especially important, because a spike across multiple devices at the same time can cause issues which are really difficult to debug.