|
|
|
|
|
by ogrisel
3500 days ago
|
|
Then the total batch size is growing with the number of GPUs and the convergence might be impacted both in terms of speed and solution quality (e.g. https://arxiv.org/abs/1609.04836 ). I could believe you if tell you me that the validation loss and test accuracy of the large distributed model remains as good as the sequential, single GPU model after the same total number of epochs but this is not a given and if it's not the case I would find those benchmarks deceptive. |
|
Both X and Y are related to the dataset and network complexity. A rough guess I often use is num_classes < X < 10num_classes and Y ~= 10X. To accelerate the convergence for batch size between X and Y, we can either increase the data augmentation or learning rate, or both. The basic idea is to add more noise to the SGD training to avoid falling into suboptimal points too easily.
The paper you mentioned studies the extremely case that batch size >> Y. They used CIFAR 10 (num_classes = 10) and batch size (20% num_examples = 12K). I also surprised that they also extended our earlier work to CNN and showed promising results (Sec 4.2)
But also as mentioned by the paper authors, there is little theory we can say about that. I expected that the research community will have fun about it for a while.
But back to the MXNet benchmark, we did successfully tuned the hyper-parameters with 128 GPUs and batch size = 32 * 128 to match the convergence compared to a single machine on the Imagenet 1K dataset. So we think our setting is reasonable. But the main point here is that we are more willing to show how fast the system can achieve, so that researchers can easier try more efficient distributed algorithms here.