| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lemonademan 18 hours ago

I personally believe once you get beyond a handful of GPUs, people probably end up using both levels of telemetry because they answer different questions. NVML is nice for per-request attribution and understanding model behavior, but I believe PDU/BMC measurements are better suited for actual power draw since they capture everything (CPUs, networking, PSU losses, fans, etc.).

For instance, people running 32+ GPU setups probably correlate timestamps rather than trying to preserve strict per-request attribution at the rack level. This will enable these individuals to have rack/PDU power sampled every second.

Either way, I haven't seen many people publish how they instrument this in practice so take what I wrote with a gran of salt. I simple wanted to share a little bit of what I understand and I hope it helps.

1 comments

anax32 11 hours ago

Yes, thank you. That's exactly where I am, and trying to gather some knowledge.

The power draw from the wall is especially important, because a spike across multiple devices at the same time can cause issues which are really difficult to debug.

link