|
|
|
|
|
by moyix
1975 days ago
|
|
For large models it does help! The training loop for multiple GPUs with data parallelism is roughly: 1. Split the data up 2. Do a forward and backward pass on each GPU individually 3. Compute the average of the gradients and update the model on each GPU 4. Repeat For step 3 you need to send the gradients from each GPU somewhere, and then send back either the averaged gradient or the updated model weights. So when the model is large (say, 3GB for GPT 774M!) that's a lot of GPU-GPU communication! You're right that for the vast majority of ML cases, the models are small enough that the synchronization cost is negligible, though. I wrote up some benchmarks here: https://github.com/huggingface/transformers/issues/9371 |
|