Sure. The $2.48/hour per V100 GPU on GCP does not include the price of the CPU host; that is purely the price to rent a single accelerator. By contrast, a network-attached Cloud TPU v3 device includes both a CPU host and four connected TPU v3 chips that collectively deliver up to 420 teraflops. Furthermore, each individual V100 GPU on GCP has 16 GB of memory, whereas the Cloud TPU v3 device has 128 GB of HBM.
The best apples-to-apples performance-per-dollar comparison we have publicly available was published last fall, and it compared the performance and cost of using various Cloud TPU v2 Pod slice sizes with the performance and cost of using various numbers of V100 GPUs attached to a single GCP host:
We went to great lengths to ensure that we trained exactly the same version of ResNet-50 to the same accuracy in the same way across all hardware configurations. The methodology predated MLPerf and is documented in full here:
If you were going to do a similar performance-per-dollar comparison today, the simplest approach might be to try to get the code from NVIDIA's MLPerf 0.6 submissions running at scale on one or more major public clouds using the fastest-available networking technology that each cloud provides:
It would be very interesting to see how distributed training performance using large-scale GPU clusters in public clouds compares with the published on-premise MLPerf performance numbers using exactly the same MLPerf code and methodology. With these measurements in hand, it would then be straightforward to make performance-per-dollar comparisons with Cloud TPU v3 Pod slices of various sizes.
The best apples-to-apples performance-per-dollar comparison we have publicly available was published last fall, and it compared the performance and cost of using various Cloud TPU v2 Pod slice sizes with the performance and cost of using various numbers of V100 GPUs attached to a single GCP host:
https://cloud.google.com/blog/products/ai-machine-learning/n...
We went to great lengths to ensure that we trained exactly the same version of ResNet-50 to the same accuracy in the same way across all hardware configurations. The methodology predated MLPerf and is documented in full here:
https://github.com/tensorflow/tpu/blob/master/benchmarks/Res...
If you were going to do a similar performance-per-dollar comparison today, the simplest approach might be to try to get the code from NVIDIA's MLPerf 0.6 submissions running at scale on one or more major public clouds using the fastest-available networking technology that each cloud provides:
https://github.com/mlperf/training_results_v0.6/tree/master/...
It would be very interesting to see how distributed training performance using large-scale GPU clusters in public clouds compares with the published on-premise MLPerf performance numbers using exactly the same MLPerf code and methodology. With these measurements in hand, it would then be straightforward to make performance-per-dollar comparisons with Cloud TPU v3 Pod slices of various sizes.