| HN Mirror

There are a lot of papers talking about the trade-off between algorithm convergence (validation accuracy) and system efficiency. At least it is my major phd research topic at CMU. In the context of synchronized SGD on deep CNN models, my observation is that up to batch-size X, the convergence speed is not so sensitive to the batch size; between X and Y, we still get good convergence rate by tuning the hyper-paramters carefully; but beyond Y, it then becomes an interesting research question.

Both X and Y are related to the dataset and network complexity. A rough guess I often use is num_classes < X < 10num_classes and Y ~= 10X. To accelerate the convergence for batch size between X and Y, we can either increase the data augmentation or learning rate, or both. The basic idea is to add more noise to the SGD training to avoid falling into suboptimal points too easily.

The paper you mentioned studies the extremely case that batch size >> Y. They used CIFAR 10 (num_classes = 10) and batch size (20% num_examples = 12K). I also surprised that they also extended our earlier work to CNN and showed promising results (Sec 4.2)

But also as mentioned by the paper authors, there is little theory we can say about that. I expected that the research community will have fun about it for a while.

But back to the MXNet benchmark, we did successfully tuned the hyper-parameters with 128 GPUs and batch size = 32 * 128 to match the convergence compared to a single machine on the Imagenet 1K dataset. So we think our setting is reasonable. But the main point here is that we are more willing to show how fast the system can achieve, so that researchers can easier try more efficient distributed algorithms here.