|
|
|
|
|
by scottlegrand2
2977 days ago
|
|
What they're not saying is that one can't use all nvlink bandwidth for gradient reduction on a DGX-1V with only 4 GPUs because nvlink is composed of 2 8-node rings. And given the data parallel nature of this benchmark, I'm very interested in where time was spent on each architecture. That said, they fixed this on NVSwitch so it's just another HW hiccup like int8 was on Pascal. |
|