|
|
|
|
|
by danielhanchen
929 days ago
|
|
So there's 2 main reasons: 1. DDP itself has an overhead since it has to synchronize gradients at each training step since GPU0 and GPU1 has to give gradients to GPU0. 2. Huggingface seems to not be optimized well for DDP mainly due to inefficient data movement - we fixed that - interestingly - even on 1 GPU it's faster. |
|
In other words, every benchmark, in either HF or Unsloth, is slower in absolute terms when going from 1 to 2 GPUs. That makes me think something is wrong with the test.
Could you share your benchmark code?