Hacker News new | ask | show | jobs
by p1esk 1440 days ago
The problem with nccl is it reports combined bandwidth: nvlink (intranode) and network. I want to see the network traffic, for example to identify a network link bottleneck when changing model or pipeline parallelism configuration.

p.s. if you solve this I’ll become a paying customer.

2 comments

Understand, we'll definitely think about the network part. Just in case it may help, if `nvidia-smi nvlink -gt d` is useful for you in this context then there is a related metric NVLink Throughput Rate to compare runs and monitor. At least you might get an idea whether/how internal links are utilized.
Yes, I thought about it - in theory I can measure the total traffic with mpirun, then substract nvlink traffic (as measured by nvidia-smi) from it. However I'm not 100% sure that the nvlink traffic from nvidia-smi is the same as the nvlink traffic component of the mpirun. I'd prefer to measure internode traffic directly (e.g. using Mellanox tools) as a more reliable method.
Yes, exactly this.