Hacker News new | ask | show | jobs
by dmitrim 1440 days ago
For now it only tries to extract NCCL time percentage from the profile, if available, and show it profile summary. Some hints count be in the step trace timeline as well. We are planning to record some NCCL related counters separately as well.
1 comments

The problem with nccl is it reports combined bandwidth: nvlink (intranode) and network. I want to see the network traffic, for example to identify a network link bottleneck when changing model or pipeline parallelism configuration.

p.s. if you solve this I’ll become a paying customer.

Understand, we'll definitely think about the network part. Just in case it may help, if `nvidia-smi nvlink -gt d` is useful for you in this context then there is a related metric NVLink Throughput Rate to compare runs and monitor. At least you might get an idea whether/how internal links are utilized.
Yes, I thought about it - in theory I can measure the total traffic with mpirun, then substract nvlink traffic (as measured by nvidia-smi) from it. However I'm not 100% sure that the nvlink traffic from nvidia-smi is the same as the nvlink traffic component of the mpirun. I'd prefer to measure internode traffic directly (e.g. using Mellanox tools) as a more reliable method.
Yes, exactly this.