Hacker News new | ask | show | jobs
by jrk 3036 days ago
[Edited] The top line results focus on comparing four TPUs in a rack node (which marketing cleverly named “one cloud TPU”), running ~16 bit mixed precision, to one GPU (out of 8 in a rack node), also capable of 16 bit or mixed precision, but handicapped to 32 bit IEEE 754. That is a misleading comparison. Images/$ are obviously more directly comparable, but again the emphasized comparisons are at different precision. Very different batch sizes make this significantly more misleading, still. Images/$ also only tells us that Google has chosen to look at the competition and set a competitive price; the per-die or per-package comparison is much more relevant to understand any intrinsic architectural advantage, since these are all large dies on roughly comparable process nodes.
5 comments

Disclosure: I work on Google Cloud.

Depends on your metric, Jonathan! If you focus on the per dollar numbers, then it’s actually net favorable to the V100, because a second GPU over NVLINK won’t be as cost-efficient. If what you care about is raw throughput “in a single box”, then 8xV100 probably comes out ahead here.

Like someone else below though, I worry about the “hey wait a minute, changing the batch size just for the TPU seems unfair” and the whole “the LSTM didn’t converge” bit. Not a bad first draft, but hopefully the authors can do some more comparisons.

Author here.

Thanks for your feedback. As I noted above, we will report further results with larger batch sizes (and smaller ones for the TPU). The LSTM not converging is one of the experiences we wanted to share. We are working on solving this issue and will update the post accordingly. Our goal is really a fair and valuable comparison, which is not easy, so we value all of the feedback.

That's why you scroll down the page to the cost comparison, which places it on a more even keel. They do also compare float16 on Volta. Physical packaging is irrelevant -- what matters is dollars to convergence and time to convergence.

(I'm obviously biased - I helped with parts of the cloud-side of cloud TPU - but I presume this comment stands on its own. :-)

To be clear, I had read the whole post, I was just being terse since the emphasis seemed to be so heavily on an apples to bananas comparison (I believe 100% of the results cited in the prose, many in bold, are with mismatched precision and batch size), with minimal articulation of the many axes of nuance here. Precision isn't defined at all in the LSTM case, and could easily be the cause of the failure of the TPU run to converge where the GPU runs do. To a non-expert audience I think the end result is confusing and misleading.

Also, while I certainly agree that the performance/dollar comparison is highly relevant to customers at a given instant, that may only tell us that Google is subsidizing this hardware now that they've deployed it, and/or that, lacking serious competition, NVIDIA has been building crazy margins into their P100/V100 prices. In understanding fundamental technological tradeoffs, and even the limits of what the pricing in a more competitive market could be, it is relevant to compare performance per unit of hardware resources (mm^2, die/package, watt, GB of HBM, etc.)

In short, these comparisons are hard, and there is no one which tells a complete story. I pushed back because, while the post includes some nuance, it brushes a great deal under the rug and focuses primarily on problematic comparison

(Further disclosure: I'm at least the third person in this sub-thread with some Google Brain/Cloud affiliation. I am speaking in my independent academic voice. I also think TPUs are great, having them publicly available now is great, and competition and diversity of architectural approach in accelerators is great. I appreciate the effort of the authors, but think the subtlety of these comparisons requires serious care.)

(1) I think we should all collectively agree to ignore the LSTM results -- without the model converging, it's impossible to say whether or not the bug that prevented convergence also affected speed. I can build an arbitrarily fast LSTM that doesn't converge in a few print statements. :-)

(2) Unless Google releases the specs to the h/w, I'd argue that cost is our best proxy. But if you assume that both Google and Amazon want to make a profit on their cloud rentals, it at least gives us a way to get to something we can normalize to (the V100's list price is public, though who knows how much Amazon pays). And, given that you can't buy a Cloud TPU, the price Google charges really is the meaningful answer. It doesn't tell us about fundamentals, but it's the right answer from a consumer standpoint.

I think it's a fair bigger-picture question to ask how we fairly and informatively benchmark cloud-only services in ways that we can not only get consumer-oriented price comparisons, but also learn from the underlying technical choices. The longer-term answer is that we beg Google to write a paper about TPUv2, as they (surprisingly!) did about TPUv1 -- because without that, we just get black box numbers combined with informed speculation based upon glossy board and heatsink photos.

btw - the best current source of specs about TPUv2 was Jeff's NIPS talks: http://learningsys.org/nips17/assets/slides/dean-nips17.pdf

Which mentions a few details like 16GB HBM per chip with 600GB/s memory bandwidth.

(3) I agree completely with you that the comparisons are hard. I'm very glad the authors of the blog post are listening to the feedback they're getting here -- on the LSTM, on batch size comparisons, and about precision and being clear about which things they're measuring.

(Reminder disclosure: It's awkward talking about Google in the third person since they pay me part time, but I'm trying to take this discussion with my academic hat also. This nested series of disclaimers is an amusing commentary about how small the machine learning + systems community is.)

Thank you for your feedback! (author here)

Our intention is really to provide a sound comparison. I think we agree that these kinds of comparisons can be hard given the constraints (e.g., lack of available technical information on TPUv2 or public implementations of optimized models for certain architectures). As I stated elsewhere, we are collecting all of the feedback and will run additional experiments.

If you know of an implementation of a mixed-precision/fp16 model that you'd like to see results for, please let us know! I may also reach out directly to you for that if you don't mind.

The amount of devices is what is completely irrelevant.

It's all about performance per dollar.

Disclosure: I work on Google Cloud.

Not necessarily. The DGX-1, for example, has pretty poor perf/$$ but reduces the time a data scientist spends waiting. For some organizations, their people time is so valuable that what matters is “what gets me my answers back faster”, because that employee is easily $100/hr+.

That’s actually why the 8xV100 with NVLINK is so attractive (and why the TPUs also have board to board networking, not just chip to chip).

Sure, for a customer. But from a technological point of view, performance per dollar doesn't tell us everything. A company could subsidize their compute service and get astounding perf/dollar with a not particularly impressive chip.

I'd like to know perf/watt, for instance, even if it doesn't matter to the customer.

agreed, it's almost purposefully very misleading. He's not even using the same version of tensorflow, or the current version of cuda (9.1).
Would you expect a big performance difference from using CUDA 9.1?
Author here.

Point well taken, we'll make sure to add a comparison to 4 and 8 GPUs. For now, a "Cloud TPU" (containing 8 cores) seems to be the smallest unit to allocate. The question of what exactly makes up a single device and how many to compare against each other is not easy to answer.