Hi, author here. The motivation for this article came out of the HN discussion on a previous post (https://news.ycombinator.com/item?id=16447096). There was a lot of valuable feedback - thanks for that.
Don't TPUs get sustained use discounts? I know they're not preemptible. That would be comparable to AWS reserved instances.
EDIT: you don't get sustained use discounts, either, at the moment. You can get either for GCP GPUs, though. Perhaps that will change once TPUs are out of beta?
"As shown above, the top-1 accuracy after 90 epochs for the TPU implementation is 0.7% better. This may seem minor, but making improvements at this already very high level is extremely difficult and, depending on the application, such small improvements may make a big difference in the end."
Any idea of how much variation in accuracy you get on different training runs of the same model on the same hardware? My understanding is that model quality can and does vary from one run to the next on these kinds of large datasets - from a single observation, it's hard to know if the difference is real or noise.
I've been running a lot of these resnet-50 experiments lately and the run-to-run variation is very small, on the order of 0.1%. It's actually pretty amazing how consistent training is given that the initialization is always different and the data is sampled differently on each run. (As an aside, it took us about three weeks to track down a bug that was causing the model to consistently reach an accuracy 1% lower than it was supposed to.)
Indeed, that's also my experience. ImageNet is pretty huge (although 'it's the new MNIST') so that seems to help converging to very similar solutions and accuracies.
Tracking down bugs in convergence is really costly in these settings. We had a problem in pre-processing that took us quite a while to figure out...
Their hardware is fine. Their software is starting to get good too now. They're finishing MIOpen, a set of CUDA compatible libraries with which you can use Tensorflow (TF uses the builtin CUDA libs too, not only CUDA itself, as does CNTK). ROCm provides a CUDA implementation for AMD systems.
I am not an ML guy, so I'm asking from a position of ignorance. (-:
But what's going on when some of the implementations of a standard algorithm don't converge, and different hardware has different accuracy rates on the same algorithm? Are DNNs really that flaky? And does it really make sense to be doing performance comparisons when the accuracy performance doesn't match?
Is the root problem that ResNet-50 works best with a smaller batch size?
And how do you do meaningful research into new DNNs if there's always an "Maybe if I ran it again over there I'd get better results" factor?
Yeah, pretty big coincidence. However, this may change with the next TensorFlow versions, which supposedly has further speed improvements for the TPUv2.
Note also, that the ~2% performance difference is only on one model (ResNet-50) and cannot be generalized to all workloads/all of deep learning (at least not without further proof).
In general, you try to keep the TPU/GPU busy 100%, so enough data needs to be readily accessible at any point in time. In this example, images needs to be read from disk, decoded, transformed (cropped, resized, normalized etc.) before they can be fed to the TPU. The transformations can be computationally intensive so they actually become a bottleneck.
In terms of how much compute power the TPU pre-processing needs I only have very rough numbers: I ran the same pre-processing while training ResNet-50 on a node with 4 GPUs and it was consistently utilizing >22 CPU cores (including all of the other CPU-tasks while training).
EDIT: you don't get sustained use discounts, either, at the moment. You can get either for GCP GPUs, though. Perhaps that will change once TPUs are out of beta?