Hacker News new | ask | show | jobs
Comparing Google’s TPUv2 against Nvidia’s V100 on ResNet-50 (blog.riseml.com)
171 points by henningpeters 2977 days ago
14 comments

Thanks for sharing and very insightful. Guess the TPUs are the real deal. About 1/2 the cost for similar performance.

Would assume Google is able to do that because of the less power required.

I am actually more curious to get a paper on the new speech NN Google is using. Suppose to be 16k samples a second through a NN is hard to imagine how they did that and was able to roll it out as you would think the cost would be prohibitive.

You are ultimately competing with a much less compute heavy solution.

https://cloudplatform.googleblog.com/2018/03/introducing-Clo...

Suspect this was only possible because of the TPUs.

Can't think of anything else where controlling the entire stack including the silicon would be more important than AI applications.

Half the cost? Where are you reading that? Yeah on demand rental in AWS is expensive, but both long term and buying V100 yourself is significantly cheaper. Cloud companies have pretty fat margins on on demand rentals.

You can’t buy a TPU, it’s a cloud only thing. They also show it’s not a huge difference in both perf and time to converge (albeit only one architecture)

I would say kudos to V100 and this benchmark that breaks the TPU hype.

The chart has 6.7 per hour for 3186 images Google and 12.2 per hour for 3128 AWS.

Or maybe reading it wrong?

That is close to half has much to use Google is it not?

BTW, The TPUs are also about twice as fast also.

Sounds like Google is pretty far ahead of Nvidia. Which really just makes sense as Google does the entire stack and just going to have the data to optimize the silicon.

About half the cost is hype?

I want in the cloud and not have to deal with updating, etc. Would think most are the same for anything of any scale. Could not imagine any longer building up rigs and dealing with all the issues. Plus much harder to scale.

It's more a comparison of AWS vs. Google Cloud pricing than Nvidia vs. TPUv2.
Strongly disagree. If Google is able to offer at about 1/2 the cost using their own silicon versus AWS using Nvidia that is all about the silicon difference.

But we also have the V1 TPU paper and can see the TPUs are able to use less joules per inference compared to an older Nvidia architecture. Was not that close. Just makes sense Google V2 TPUs would do the same.

Hope Google does a V3 TPU and then will share a V2 TPU paper like they did on V1 of the TPUs.

What is far more impressive of the TPUs is

https://cloudplatform.googleblog.com/2018/03/introducing-Clo...

If really doing 16k a second through a NN and at a price you can offer generally now that is incredible. I want this paper even more so.

What makes you so sure it is all the silicon difference and not just AWS pricing their product at a more profitable price point?

These costs also ignore transferring and storing massive data sets in the cloud. In general the cloud is a huge pain and I'd avoid it like the plague unless I was caught and really, really needed the scalability. But even then that only works if you have a scalable implementation of the algorithm you are working on.

Maybe, maybe not. They have the advantage that they make the hardware, so they're not paying as much retail as nvidia is charging them for their cards. I don't think there's any way you can say the TPU is cheaper compared to buying your own system. If Google decides to release it to the public, that's a different story. Also, keep in mind that Google allows you to mix and match the CPU core count to GPU, whereas AWS doesn't. It's possible that the Google cloud price with fewer CPU cores will be much cheaper than the AWS instance.
If anything, the pricing likely benefits Google. As in Google may be more profitable with the TPU usage, even at 1/2 the cost of Amazon's V100 usage.
fwiw, the "TPU instance " has more than one tpu chip on it.
The architectures are so radically different that I don't think it makes sense to try to compare anything but the whole system performance. Trying to do a 1 to 1 comparison for a core or a chip becomes pretty nebulous because the architectures are radically different.
It has more than the chips, too, since the TPUs can't run a TCP/IP stack, gRPC server, etc.
See the chart titled: Performance in images per second per $.

TPUv2 is has 1.27x-1.86x the images/s/$.

And the other chart titled: Cost to reach 75.7% top-1 accuracy.

Where TPUv2 costs 62.5% the reserved GPU instance and 42.6% the unreserved GPU cost.

Key takeaway from the article:

> While the V100s perform similarly fast, the higher price and slower convergence of the implementation results in a considerably higher cost-to-solution.

The impression I got was opposite: TPU is not the hot shit that Google claims it is. Pricing is kind of irrelevant since they can subsidize this to create that story.
I know an engineer who prototypes GPU-like systems with FPGA and he has told me to be skeptical about performance miracles.

No matter how fast a system is on the inside you have to get data in and out of it -- at the very least to memory. SRAM takes too much area and there is a limit DRAM bandwidth despite technologies such as eDRAM and HBM. Some tasks are compute intensive, but for general tasks, a processor that is 100x faster would need 100x faster memory to really be 100x faster.

Thus advances in real-life performance are likely to be more like a factor of 2.

For training I never pay full price in the AWS cloud, rather I run interruptable instances and pay a fraction of the list price. People I know who train in the Google cloud seem to get interrupted all the time even though they are paying full price.

Inference is another story. Once you have the trained model, you will usually need to run inference many many more times than you run training and this gets more so the bigger scale you are running at. That hits your unit costs and it is where you need to pinch every penny.

> Pricing is kind of irrelevant since they can subsidize this to create that story.

Depends on how much you plan to use the hardware. If it's running near continuously, total cost of ownership is very important. Power costs can quickly dominate TCO.

At the pricing extreme, Google could make their TPUs free to use and charge elsewhere in their cloud. This shows that literal pricing is pretty irrelevant.
So could AWS/Nvidia.
AWS yes. Nvidia, not so sure. When you buy a 1080ti you are competing with gamers and miners (and maybe others). There's nothing to subsidize, in fact those cards are selling above MSRP, because they aren't selling an ecosystem but a physical card.
Did you get that impression from this line in the article?

> While the V100s perform similarly fast, the higher price and slower convergence of the implementation results in a considerably higher cost-to-solution.

Full disclosure, I currently work at Nvidia on speech synthesis.

You can definitely do this on a GPU. We use the older auto-regressive WaveNets (not Parallel Wavenet) for inference on GPUs, with the newly released nv-wavenet code. Here's a link to a blog post about it:

https://devblogs.nvidia.com/nv-wavenet-gpu-speech-synthesis

That code will generate audio samples at 48khz, or if you're worried about throughput, it'll do a batch of 320 parallel utterances at 16khz.

> About 1/2 the cost for similar performance.

I would expect a dedicated accelerator to need at least a 5-10X advantage to outweigh all the other infrastructure and ecosystem costs.

GPUs are more useful for a wide variety of data-parallel tasks, and many more NN frameworks work on top of CUDA than work on the TPU.

In terms of horizontal scalability, nvidia has been rapidly iterating on increasing both memory and interlink bandwidth (including NVSwitch [1]), while each 'TPU' is actually 4 chips interconnected so likely has less upward scalability.

Also note that the tensor cores on a V100 take roughly 25-30% of the actual area. If Nvidia wanted to, they could probably easily make a pure tensor chip that beat the TPU in performance, could be produced in volume on their existing process, and also had full compatibility with their entire stack.

All in all, a 2x price/performance advantage for a hyper-specialized accelerator is basically a loss, just like how nobody installs a Soundblaster card anymore, how consumer desktops don't run discrete GPUs even though integrated graphics are a few times slower, or

[1] https://www.nextplatform.com/2018/04/04/inside-nvidias-nvswi...

If that 2x price/performance scales for all of Google's inferencing then it is definitely not a loss for them. If they can halve their running costs for inferencing then they are saving themselves a ton of money. Their TPUv2 was announced slightly before the V100 and the money savings they make by not paying Nvidia premiums probably helps. From the customer point of view, what is a GPU other than a specialised accelerator. Without more details we can't know how a TPU really compares, but if your aim is to train/run inference of Tensorflow models, then they're a really competitive product at the moment.
I agree, but chip development is an expensive business. There is nothing preventing Nvidia from immediately turning around and building a specialised ML accelerator with better software integration and higher bandwidth. For all we know they could already be working on one.
They already did two generations. Google has over $100B in the bank with less than $4B debt. So money is not an issue. It is tiny in the scheme of things.

Google has an advantage as they do the entire stack and can better optimize like we see here with half the cost.

Nvidia is actively building an entire deep learning stack internally, all the way to releasing a self-driving simulation platform which they are using to build their own self-driving software [1].

I think they are actually farther along and more aggressive about exploring deep learning use cases in production than Google today; augmenting real data with extensive simulation is really a far-reaching idea that comes directly from their gaming experience.

> So money is not an issue. It is tiny in the scheme of things.

Money of course is always an issue long term; otherwise why doesn't Google Fiber just spend tens of billions of dollars to build out its nationwide network? Because it will see negative ROI even if they succeed.

The TPU has to eventually make a real return to Google, and it won't if nvidia can spend the same amount of money and build a faster product and sell it to all the other cloud players, which I believe they definitely can.

Put another way, the TPU has to be cheaper to Google than buying nvidia GPUs after factoring in its development costs, whereas nvidia gets to amortize those dev costs over all other cloud providers and all other GPU customers. Google isn't about to sell the TPU to other cloud providers; the entire idea is to use it to drive Google Cloud adoption.

The TPU is a fine chip, but if you just look at the big picture, there is every sign that nvidia could build the same or better product for less money because it has far more synergies across the hardware and chip design stack; e.g. the TPU only has PCIe connectors, while nvidia has already worked with IBM to get NVLink into supercomputers [2]. For some workloads the TPU will likely be bandwidth-starved communicating with the CPU and main memory.

[1] https://nvidianews.nvidia.com/news/nvidia-introduces-drive-c...

[2] https://www.ibm.com/us-en/marketplace/power-systems-ac922/de...

Hi, author here. The motivation for this article came out of the HN discussion on a previous post (https://news.ycombinator.com/item?id=16447096). There was a lot of valuable feedback - thanks for that.

Happy to answer questions!

Don't TPUs get sustained use discounts? I know they're not preemptible. That would be comparable to AWS reserved instances.

EDIT: you don't get sustained use discounts, either, at the moment. You can get either for GCP GPUs, though. Perhaps that will change once TPUs are out of beta?

"As shown above, the top-1 accuracy after 90 epochs for the TPU implementation is 0.7% better. This may seem minor, but making improvements at this already very high level is extremely difficult and, depending on the application, such small improvements may make a big difference in the end."

Any idea of how much variation in accuracy you get on different training runs of the same model on the same hardware? My understanding is that model quality can and does vary from one run to the next on these kinds of large datasets - from a single observation, it's hard to know if the difference is real or noise.

I've been running a lot of these resnet-50 experiments lately and the run-to-run variation is very small, on the order of 0.1%. It's actually pretty amazing how consistent training is given that the initialization is always different and the data is sampled differently on each run. (As an aside, it took us about three weeks to track down a bug that was causing the model to consistently reach an accuracy 1% lower than it was supposed to.)
Indeed, that's also my experience. ImageNet is pretty huge (although 'it's the new MNIST') so that seems to help converging to very similar solutions and accuracies.

Tracking down bugs in convergence is really costly in these settings. We had a problem in pre-processing that took us quite a while to figure out...

AMD - Where does their hardware stand in the race for ML? What changes would AMD need to make to be competitive?
Their hardware is fine. Their software is starting to get good too now. They're finishing MIOpen, a set of CUDA compatible libraries with which you can use Tensorflow (TF uses the builtin CUDA libs too, not only CUDA itself, as does CNTK). ROCm provides a CUDA implementation for AMD systems.
Their hardware doesn't have the equivalent of a tensor core as far as I know, so they would be way behind on these benchmarks.
Nice work. I've only seen anecdotal stories about how TPU is faster, but never something as detailed as this.
I am not an ML guy, so I'm asking from a position of ignorance. (-:

But what's going on when some of the implementations of a standard algorithm don't converge, and different hardware has different accuracy rates on the same algorithm? Are DNNs really that flaky? And does it really make sense to be doing performance comparisons when the accuracy performance doesn't match?

Is the root problem that ResNet-50 works best with a smaller batch size?

And how do you do meaningful research into new DNNs if there's always an "Maybe if I ran it again over there I'd get better results" factor?

Thank you.

I found it interesting that they are so close together in performance - I mean what are the odds that they end up within 2% of each other?
The TPUs are doing almost 2x the images for the same cost.

That is not all that close is it?

Yeah, pretty big coincidence. However, this may change with the next TensorFlow versions, which supposedly has further speed improvements for the TPUv2.

Note also, that the ~2% performance difference is only on one model (ResNet-50) and cannot be generalized to all workloads/all of deep learning (at least not without further proof).

Do you have more information about this bit?

the TPU implementation applies very compute-intensive image pre-processing steps and actually sacrifices raw throughput

Thanks

In general, you try to keep the TPU/GPU busy 100%, so enough data needs to be readily accessible at any point in time. In this example, images needs to be read from disk, decoded, transformed (cropped, resized, normalized etc.) before they can be fed to the TPU. The transformations can be computationally intensive so they actually become a bottleneck.

In terms of how much compute power the TPU pre-processing needs I only have very rough numbers: I ran the same pre-processing while training ResNet-50 on a node with 4 GPUs and it was consistently utilizing >22 CPU cores (including all of the other CPU-tasks while training).

What about your LSTM-based model that didn’t converge in your earlier TPU benchmarks in February?
Slower alternative: "fastai with @pytorch on @awscloud is currently the fastest to train Imagenet on GPU, fastest on a single machine (faster than Intel-caffe on 64 machines!), and fastest on public infrastructure (faster than @TensorFlow on a TPU!) Big thanks to our students that helped with this." - https://twitter.com/jeremyphoward/status/988852083796291584
One machine with 8 V100 GPUs. If you consider one TPU pod a single machine the TPU is faster. Those numbers also show that 8 GPUs are slower than 8 TPUs (so same conclusion as the article)
An important hidden cost here is coding a model which can take advantage of mixed-precision training. It is not trivial: you have to empirically discover scaling factors for loss functions, at the very least.

It's great that there is now wider choice of (pre-trained?) models formulated for mixed-precision training.

When I was comparing Titan V (~V100) and 1080ti 5 months ago, I was only able to get 90% increase in forward-pass speed for Titan V (same batch-size), even with mixed-precision. And that was for an attention-heavy model, where I expected Titan V to show its best. Admittedly, I was able to use almost double the batch-size on Titan V, when doing mixed-precision. And Titan V draws half the power of 1080ti too :)

At the end my conclusion was: I am not a researcher, I am a practitioner - I want to do transfer learning or just use existing pre-trained models - without tweaking them. For that, tensor cores give no benefit.

Author here.

Yes, thanks for mentioning that! That's what the article is alluding to at the end. There's also something like a "cost-to-model" and that's influenced by how easy it is to make efficient use of the performance and how much tweaking it needs. It's also influenced by the framework you use... However, that's difficult to compare and almost impossible to measure.

How did you get your hands on Titan V 5 months ago? I still can't find it anywhere in retail in EU...
It was in stock on and off and I was able to order it directly from Nvidia US.

After 59 days of playing with it, I sent it back (initiated return on 30th day, after I already figured out it doesn't live up to the hype, then had another 30 days to actually send it back).

With $3,000 I can buy 4 1080ti's, while only two are necessary to beat Titan V (in Titan V's best game). I only bought one though. NowInStock.net helped with buying 1080ti directly from Nvidia.

Nvidia is currently in cashing out phase. They have monopoly and money flows in effortlessly. The cost performance ratio reflects this.

AMD will enter the game soon once they get their software working, Intel will follow.

I suspect that Nvidia will respond with its own specialized machine learning and inference chips to match the cost/performance ratio. As long as Nvidia can maintain high manufacturing volumes and small performance edge, they can still make good profits.

"The cost performance ratio reflects this."

But the TPUs are half the cost per this article?

Plus Google does the entire stack and can better optimize the hardware versus Nvidia. So it seem Google can improve faster I would think.

If there ever was a huge advantage doing the entire stack it is with neural networks.

A perfect example is Google new speech doing 16k samples a second with a NN.

https://cloudplatform.googleblog.com/2018/03/introducing-Clo...

Do not think Google could offer this service as a competitive cost without the TPUs.

This new method is replacing the method that was far less compute intensive so to offer at a competitive price requires lowering compute cost which suspect is only possible with the TPUs.

> But the TPUs are half the cost per this article?

Exactly. Nvidia can match the performance already without 100% specialized processor. It's the just the price they need to cut by optimizing their architecture for tensor processing and reducing their profits when competition emerges.

Google is not in the business of becoming a major chip maker or competing with Nvidia head on. Putting hundreds of millions into new microarchitecture every second year eats lots of resources. They just want competitive market and the prices to go down.

I'm not sure what you mean by google does the entire stack. Nvidia writes all of the major CUDA libraries used behind the scenes in the NN libraries, such as cuDNN, cuBLAS, etc. Nvidia can likely improve their hardware significantly faster/more efficiently than Google can because their entire business depends on it. Google has incentive for improving their TPU for internal use, but they don't make any money by selling TPU time on GCP yet.
> I'm not sure what you mean by google does the entire stack.

Consider that Google has some of the best machine learning researchers, compiler engineers, hardware engineers, and infrastructure in the business working on this.

Huh? Machine learning and infrastructure Engineers, yes. Compiler and Hardware engineers? No. What gives you reason to believe they have a lead in either of those departments other than they have a lot of money? They're forced to use the same foundry as Nvidia, and their Hardware team is likely significantly smaller.
Google been buying up AI resources well before anyone else and has the strongest and deepest team at this point.

It is why so many of the break throughs have come from Google. Great example is winning at Go almost a decade earlier than anyone thought possible.

They probably two of the strongest teams with one the Brain team and then the Deepmind team. But all the other engineers and infrastructure is first rate at Google.

Really at this point do not think the $100B cash is as important as Google already built the team and now experinced resources are far more difficult to get.

The other advantage for Google is their ability to attract the top engineers in addition.

Google just got started a lot earlier on all of this.

Google does the applications at scale and then each layer below and a big one is TF. A great example is the recent release of the new text to speech using NN.
When you use a Google service that uses the TPUs they are indirectly selling the TPUs.
>For GPUs, there are further interesting options to consider next to buying. For example, Cirrascale offers monthly rentals of a server with four V100 GPUs for around $7.5k (~$10.3 per hour). However, further benchmarks are required to allow a direct comparison since the hardware differs from that on AWS (type of CPU, memory, NVLink support etc.).

Can't you just buy some 1080s for cheaper than this. I understand there is electricity and hosting costs, but cloud computing seems expensive compared to buying equipment.

Yes, you can. The problem starts when "you" are a large company -- NVidia restricts "datacenter" use of consumer GPUs (see previous HN discussion of that one: https://news.ycombinator.com/item?id=15983587 ). A single Titan V is somewhere in the 90% range of a V100 at less than 1/3 the cost, and a 1080ti, if you can find one, likely offers a slightly better price/performance spot. 4-GPU training may suffer due the lack of NVlink, but not enough for it to matter too much. As you scale, though, the lack of NVlink will hurt more. And, of course, all of these things come with a capex vs opex tradeoff, and a sysadmin vs cloud tradeoff, that will appeal differently to different situations.
With a mining exception for some reason, and their drivers blocking themselves when running in a virtualized environment unless you do some hacks.
The new "datacenter" restriction only applies to GeForce branded cards. The Titan V is now called the "NVIDIA Titan V" and with no GeForce branding to be found anywhere.

So the restriction applies to the 1080ti but _not_ the titan V. I completely agree the restriction is total bullshit but it's important to get the facts straight.

Not according to the statement from NVidia quoted in this article: https://www.cnbc.com/2017/12/27/nvidia-limits-data-center-us...

It applies to both GeForce and Titan.

You're right - it seems like they have added "Titan" to the agreement since it was first posted on HN:

http://www.nvidia.com/content/DriverDownload-March2009/licen...

Thanks for the tip!

Hire people to buy 1080 in retail. This problem is solvable easily.
It's not about getting the cards (though supplies are limited because of cryptocurrency mining, but you could buy Titan V's off the shelf in batches of 2). It's about whether or not you're big enough of a target for Nvidia's lawyers if you violate the agreement and actually build a datacenter out with them.
It's hard to find 1080[ti]+ in retail. Whenever they become available they sell out pretty quickly.
Probably not the best phrasing in the post ("next to buying"). It's only comparing cloud pricing (since the TPUv2 is only available there). If you consider buying hardware the situation is different as you correctly point out.
1080s don't have the "tensor cores" of V100, or NVLink, so they will not get anywhere near the same performance on this benchmark.
Excellent! Thanks for these numbers, I wanted to see exactly this kind of benchmarks! Do you plan to try different benchmarks with the same setup for different problems, like semantic segmentation, DenseNet, LSTM training performance etc. as well?
Happy to hear the benchmark is useful to you! We'd love to try different setups and further models/networks. On the other hand, such benchmarks are a LOT of effort (which we underestimated it initially), so we'll have to see.
Excellent work. Do you have plans to open source the scripts/implementation details used to reproduce the results? Would be great if others can also validate and repeat the experiment for future software updates (e.g. TensorFlow 1.8) as I expect there will be some performance gain for both TPU and GPU by CUDA and TensorFlow optimizations.

Sidenote: Love the illustrations that accompany most of your blog posts, are they drawn by an in-house artist/designer?

Happy you like the post! The implementations we used are open source (we reference the specific revisions), so reproducing results is possible right now. We haven't thought about publishing our small scripts around that (there's not much to it), but it's a good idea. There's also work towards benchmarking suites like DAWNBench (https://dawn.cs.stanford.edu/benchmark/).

The illustrations are from an artist/designer we contract from time to time. I agree, his work is awesome!

> The illustrations are from an artist/designer we contract from time to time. I agree, his work is awesome!

Kudos to them; they are awesome!

What they're not saying is that one can't use all nvlink bandwidth for gradient reduction on a DGX-1V with only 4 GPUs because nvlink is composed of 2 8-node rings. And given the data parallel nature of this benchmark, I'm very interested in where time was spent on each architecture.

That said, they fixed this on NVSwitch so it's just another HW hiccup like int8 was on Pascal.

For this benchmark, NVLink and gradient reduction isn't the bottleneck. The performance scales almost perfectly linearly from one GPU to four.
Thanks for this, just a minor thing:

You have price per hour and performance per second. Thus that ratio is not performance per image per $, you need to scale that. Also, the metric is not "images per second per $", but just "images per $".

Thanks for catching this!
How much detail do we know about the TPUs' design? Does Google disclose a block-diagram level? ISA details? Do they release a toolchain for low-level programming or only higher-level functions like TensorFlow?

EDIT: I found [1] which describes "tensor cores", "vector/matrix units" and HBM interfaces. The design sounds similar in concept to GPUs. Maybe they don't have or need interpolation hw or other GPU features?

[1] https://cloud.google.com/tpu/docs/system-architecture

Great paper on the Generation 1 TPU. But Google has not shared much details on gen 2 and in some ways kind of hid information.

Suspect we will need a gen 3 to get a paper on the gen 2.

Here is the gen 1 paper and highly recommend. Pretty interesting using 65536 very simple cores.

https://arxiv.org/ftp/arxiv/papers/1704/1704.04760.pdf

So far only very few details are disclosed. Here are two presentations:

https://supercomputersfordl2017.github.io/Presentations/Imag... http://learningsys.org/nips17/assets/slides/dean-nips17.pdf

For the last version of the TPU, Google provided more detail, e.g., in this paper:

https://arxiv.org/pdf/1704.04760.pdf

Hopefully, Google will publish something similar for TPUv2, but I have no knowledge whether or when that might happen.

> Maybe they don't have or need interpolation hw or other GPU features?

Definitely, no need to do any kind of rasterization here.

Great work, RiseML. This benchmark is sincerely appreciated.

I wonder whether NVLink would make any difference for Resnet-50. Does anyone know whether these implementations require any inter-GPU communication?

They don't require it but some of the ResNet-50 implementations can make use of it (e.g., the ones in the Docker containers on the Nvidia GPU Cloud). But even the ones without seem to scale to 4 GPUs pretty well. This may be a different story for 8 GPUs and larger/deeper networks, e.g., ResNet-152.
Was this running the AWS Deep Learning AMI or did you build your own.

Because Intel was involved in its development and made a number of tweaks to improve performance.

Be curious if it actually was significant or not.

On AWS this was using nvidia-docker with the TensorFlow Docker images. Probably, the AWS AMI Deep Learning gives very similar performance (with same versions of CUDA, TensorFlow etc.). There's only so much you can tweak if the GPU itself is the bottleneck...
>For the V100 experiments, we used a p3.8xlarge instance (Xeon E5–2686@2.30GHz 16 cores, 244 GB memory, Ubuntu 16.04) on AWS with four V100 GPUs (16 GB of memory each). For the TPU experiments, we used a small n1-standard-4 instance as host (Xeon@2.3GHz two cores, 15 GB memory, Debian 9) for which we provisioned a Cloud TPU (v2–8) consisting of four TPUv2 chips (16 GB of memory each).

A bit odd that the TPUs are provisioned on such a weaker machine compared to the V100s, especially when there were comparisons which included augmentation and other processing outside of the TPU.

All of the computation, including pre-processing, is offloaded to the TPU. The weak machine is really just idling. A bigger one will only cost money and have no measurable effect on the performance.
What is the cost difference between the CPUs on the google cloud vs AWS? How would adjusting for it effect the cost/images ratio?
This is why my previous comment mentioned that GCP is a better benchmark for this since you can select the number of CPUs to match with the GPUs to some extent. You can get a rough idea of the savings by looking at their P100 instances.
The TPU is not really just the chip. It has an actual machine that is provisioned behind the scenes and accepts RPC calls. Good luck finding out its specs. All you're supposed to care about are the address and port it answers at.