Y
Hacker News
new
|
ask
|
show
|
jobs
by
learndeeply
1398 days ago
Both - the first is for sharding tensors across GPUs, the second is to do an all reduce (e.g. for distributed data parallel to synchronize gradients)