Hacker News new | ask | show | jobs
by oneshot908 3501 days ago
Using 3 year-old GPUs on a much deeper network than the other guys(tm) to demonstrate awesome scaling efficiency == Intel-level FUD. Note also the absence of overall batch size.

Wonder what would happen to that scaling efficiency if those GPUs were P40s?

See also the absence of equivalent AlexNet numbers to further obscure attempts at comparing this to the other guys(tm).

Can't wait for Intel's response to this.

3 comments

What is really fishy is evaluating training time speed ups in terms of throughput. The latency induced by the parallelism mechanism (when using asynchronous data parallelism) might seriously hamper the convergence speed. The presence of this potential problem cannot be detected in the throughput metric. They should have used a convergence metric instead (e.g. training time to reach 99% of the best validation loss).

If they can achieve 109x speed up with 128 GPUs using synchronous data parallelism with a batch size tuned for optimal single GPU convergence time, then this is very impressive (but quite unlikely).

However I don't think that publishing training benchmarks on Inception v3 (vs say AlexNet) is a fraud. Inception v3 is close to the state of the art and very good at using few parameters & inference FLOPS for a good test accuracy.

Inception v3 has been publicly available for quite a long time in a variety of DL toolkits along with pre-trained weights.

The results are reported using synchronized SGD with each GPU using batch-size 32. More details such as scripts to reproduce the results, scalability results on various networks (including Alexnet) and various batch sizes will be available soon. I'll put more technical details such as implementation details and performance analysis in my phd thesis.
Then the total batch size is growing with the number of GPUs and the convergence might be impacted both in terms of speed and solution quality (e.g. https://arxiv.org/abs/1609.04836 ).

I could believe you if tell you me that the validation loss and test accuracy of the large distributed model remains as good as the sequential, single GPU model after the same total number of epochs but this is not a given and if it's not the case I would find those benchmarks deceptive.

There are a lot of papers talking about the trade-off between algorithm convergence (validation accuracy) and system efficiency. At least it is my major phd research topic at CMU. In the context of synchronized SGD on deep CNN models, my observation is that up to batch-size X, the convergence speed is not so sensitive to the batch size; between X and Y, we still get good convergence rate by tuning the hyper-paramters carefully; but beyond Y, it then becomes an interesting research question.

Both X and Y are related to the dataset and network complexity. A rough guess I often use is num_classes < X < 10num_classes and Y ~= 10X. To accelerate the convergence for batch size between X and Y, we can either increase the data augmentation or learning rate, or both. The basic idea is to add more noise to the SGD training to avoid falling into suboptimal points too easily.

The paper you mentioned studies the extremely case that batch size >> Y. They used CIFAR 10 (num_classes = 10) and batch size (20% num_examples = 12K). I also surprised that they also extended our earlier work to CNN and showed promising results (Sec 4.2)

But also as mentioned by the paper authors, there is little theory we can say about that. I expected that the research community will have fun about it for a while.

But back to the MXNet benchmark, we did successfully tuned the hyper-parameters with 128 GPUs and batch size = 32 * 128 to match the convergence compared to a single machine on the Imagenet 1K dataset. So we think our setting is reasonable. But the main point here is that we are more willing to show how fast the system can achieve, so that researchers can easier try more efficient distributed algorithms here.

Are claiming MXNet gets a ~109x speedup in training time to X% accuracy with synchronized SGD on 128 gpus?
Amazon probably used P2 because they want to advertise it. We can get almost linear speedup on 10 8xM40 machines using MXNet. Batch size is linearly increased with # of machines but empirically it doesn't hurt convergence, at least on imagenet.

I mean who cares about AlexNet any more? It's 2016 already. It trains in under 2h on a single machine. Distributing it doesn't make much sense

Publish those numbers with the sample code to reproduce them. Your first paragraph is enough for an awesome white paper/use case to drive adoption. Don't let silly AWS internal politics get in the way if you work there. Find a workaround.

Amazon is at its best when it's customer obsessed and at its worst when it puts politics first.

All IMO of course.

2 hours to train Alexnet on a single machine? Link please.
https://developer.nvidia.com/cudnn Alex did it on 2x580 in 2012. Took him 1 week. It's 60x faster now even compared to K40
Comparisons on AlexNet are not very useful now. I can get AlexNet-like quality a lot cheaper (at test time) now, and for the same computational cost I can get a lot better in quality of results ... or even better if I accept more cost. I can't think of a good reason to evaluate AlexNet nowadays, I'm more annoyed at the other guys(tm) that (exclusively) do, since that means to get meaningful datapoints I need to rerun the experiments myself.
AlexNet #s IMO provide an excellent ballpark estimate of how well balanced compute and communication are in terms of both the framework and the underlying platform.

A platform that runs AlexNet well has excellent computation performance for the convolution layers but it also has excellent algorithms/communication for parallelizing the model/data by whatever means.

Networks that attempt to minimize computation and/or communication are cool, but they should be considered in that light IMO.

It's also a great estimate of the low-end for strong scaling. There's a lot of bread and butter machine learning at this level in my experience.