Yeah, high bandwidth requirements still remaining. Over the past year, more research has looked from fully async to restrained cases that allow for geographically distributed compute. Async Local-SGD goes for a more standard training objective comparable with a lockstep training, https://arxiv.org/abs/2401.09135. imo technique is looking better.
The second article you linked indicates there will still be intense bandwidth requirements during training, shipping around gradient differentials.
What has changed in the past year? Is this technique looking better, worse, or the same?