|
|
|
|
|
by ogrisel
3500 days ago
|
|
What is really fishy is evaluating training time speed ups in terms of throughput. The latency induced by the parallelism mechanism (when using asynchronous data parallelism) might seriously hamper the convergence speed. The presence of this potential problem cannot be detected in the throughput metric. They should have used a convergence metric instead (e.g. training time to reach 99% of the best validation loss). If they can achieve 109x speed up with 128 GPUs using synchronous data parallelism with a batch size tuned for optimal single GPU convergence time, then this is very impressive (but quite unlikely). However I don't think that publishing training benchmarks on Inception v3 (vs say AlexNet) is a fraud. Inception v3 is close to the state of the art and very good at using few parameters & inference FLOPS for a good test accuracy. Inception v3 has been publicly available for quite a long time in a variety of DL toolkits along with pre-trained weights. |
|