|
|
|
|
|
by xtacy
653 days ago
|
|
It's also a bit odd that they do not implement congestion control. Congestion control is fundamental unless you only have point-to-point data transfers, which is rarely the case. All-reduce operation during training requires N to 1 data transfer. In these scenarios the sender needs to control its data transfer rates so as to not overwhelm not just the receiver, but also the network... if this is not done, it will cause congestion collapse (https://en.wikipedia.org/wiki/Network_congestion#:~:text=ser...). |
|
> We proceeded without DCQCN for our 400G deployments. At this time, we have had over a year of experience with just PFC for flow control, without any other transport-level congestion control. We have observed stable performance and lack of persistent congestion for training collectives.
https://engineering.fb.com/2024/08/05/data-center-engineerin...