Hacker News new | ask | show | jobs
by ogrisel 3500 days ago
What is really fishy is evaluating training time speed ups in terms of throughput. The latency induced by the parallelism mechanism (when using asynchronous data parallelism) might seriously hamper the convergence speed. The presence of this potential problem cannot be detected in the throughput metric. They should have used a convergence metric instead (e.g. training time to reach 99% of the best validation loss).

If they can achieve 109x speed up with 128 GPUs using synchronous data parallelism with a batch size tuned for optimal single GPU convergence time, then this is very impressive (but quite unlikely).

However I don't think that publishing training benchmarks on Inception v3 (vs say AlexNet) is a fraud. Inception v3 is close to the state of the art and very good at using few parameters & inference FLOPS for a good test accuracy.

Inception v3 has been publicly available for quite a long time in a variety of DL toolkits along with pre-trained weights.

1 comments

The results are reported using synchronized SGD with each GPU using batch-size 32. More details such as scripts to reproduce the results, scalability results on various networks (including Alexnet) and various batch sizes will be available soon. I'll put more technical details such as implementation details and performance analysis in my phd thesis.
Then the total batch size is growing with the number of GPUs and the convergence might be impacted both in terms of speed and solution quality (e.g. https://arxiv.org/abs/1609.04836 ).

I could believe you if tell you me that the validation loss and test accuracy of the large distributed model remains as good as the sequential, single GPU model after the same total number of epochs but this is not a given and if it's not the case I would find those benchmarks deceptive.

There are a lot of papers talking about the trade-off between algorithm convergence (validation accuracy) and system efficiency. At least it is my major phd research topic at CMU. In the context of synchronized SGD on deep CNN models, my observation is that up to batch-size X, the convergence speed is not so sensitive to the batch size; between X and Y, we still get good convergence rate by tuning the hyper-paramters carefully; but beyond Y, it then becomes an interesting research question.

Both X and Y are related to the dataset and network complexity. A rough guess I often use is num_classes < X < 10num_classes and Y ~= 10X. To accelerate the convergence for batch size between X and Y, we can either increase the data augmentation or learning rate, or both. The basic idea is to add more noise to the SGD training to avoid falling into suboptimal points too easily.

The paper you mentioned studies the extremely case that batch size >> Y. They used CIFAR 10 (num_classes = 10) and batch size (20% num_examples = 12K). I also surprised that they also extended our earlier work to CNN and showed promising results (Sec 4.2)

But also as mentioned by the paper authors, there is little theory we can say about that. I expected that the research community will have fun about it for a while.

But back to the MXNet benchmark, we did successfully tuned the hyper-parameters with 128 GPUs and batch size = 32 * 128 to match the convergence compared to a single machine on the Imagenet 1K dataset. So we think our setting is reasonable. But the main point here is that we are more willing to show how fast the system can achieve, so that researchers can easier try more efficient distributed algorithms here.

Are claiming MXNet gets a ~109x speedup in training time to X% accuracy with synchronized SGD on 128 gpus?