Hacker News new | ask | show | jobs
by learndeeply 1398 days ago
Both - the first is for sharding tensors across GPUs, the second is to do an all reduce (e.g. for distributed data parallel to synchronize gradients)