|
|
|
|
|
by neilmovva
929 days ago
|
|
promising results, excited to try it out! question on the perf benchmarks: why do all the results with 2 GPUs & DDP take longer than the single GPU case?
Both benchmarks do the same amount of work, one training epoch, so this negative scaling is surprising. |
|
1. DDP itself has an overhead since it has to synchronize gradients at each training step since GPU0 and GPU1 has to give gradients to GPU0.
2. Huggingface seems to not be optimized well for DDP mainly due to inefficient data movement - we fixed that - interestingly - even on 1 GPU it's faster.