Hacker News new | ask | show | jobs
by scottlegrand2 2977 days ago
What they're not saying is that one can't use all nvlink bandwidth for gradient reduction on a DGX-1V with only 4 GPUs because nvlink is composed of 2 8-node rings. And given the data parallel nature of this benchmark, I'm very interested in where time was spent on each architecture.

That said, they fixed this on NVSwitch so it's just another HW hiccup like int8 was on Pascal.

1 comments

For this benchmark, NVLink and gradient reduction isn't the bottleneck. The performance scales almost perfectly linearly from one GPU to four.