Hacker News new | ask | show | jobs
by mli 3500 days ago
The results are reported using synchronized SGD with each GPU using batch-size 32. More details such as scripts to reproduce the results, scalability results on various networks (including Alexnet) and various batch sizes will be available soon. I'll put more technical details such as implementation details and performance analysis in my phd thesis.
2 comments

Then the total batch size is growing with the number of GPUs and the convergence might be impacted both in terms of speed and solution quality (e.g. https://arxiv.org/abs/1609.04836 ).

I could believe you if tell you me that the validation loss and test accuracy of the large distributed model remains as good as the sequential, single GPU model after the same total number of epochs but this is not a given and if it's not the case I would find those benchmarks deceptive.

There are a lot of papers talking about the trade-off between algorithm convergence (validation accuracy) and system efficiency. At least it is my major phd research topic at CMU. In the context of synchronized SGD on deep CNN models, my observation is that up to batch-size X, the convergence speed is not so sensitive to the batch size; between X and Y, we still get good convergence rate by tuning the hyper-paramters carefully; but beyond Y, it then becomes an interesting research question.

Both X and Y are related to the dataset and network complexity. A rough guess I often use is num_classes < X < 10num_classes and Y ~= 10X. To accelerate the convergence for batch size between X and Y, we can either increase the data augmentation or learning rate, or both. The basic idea is to add more noise to the SGD training to avoid falling into suboptimal points too easily.

The paper you mentioned studies the extremely case that batch size >> Y. They used CIFAR 10 (num_classes = 10) and batch size (20% num_examples = 12K). I also surprised that they also extended our earlier work to CNN and showed promising results (Sec 4.2)

But also as mentioned by the paper authors, there is little theory we can say about that. I expected that the research community will have fun about it for a while.

But back to the MXNet benchmark, we did successfully tuned the hyper-parameters with 128 GPUs and batch size = 32 * 128 to match the convergence compared to a single machine on the Imagenet 1K dataset. So we think our setting is reasonable. But the main point here is that we are more willing to show how fast the system can achieve, so that researchers can easier try more efficient distributed algorithms here.

Are claiming MXNet gets a ~109x speedup in training time to X% accuracy with synchronized SGD on 128 gpus?