Hacker News new | ask | show | jobs
by jrk 3036 days ago
To be clear, I had read the whole post, I was just being terse since the emphasis seemed to be so heavily on an apples to bananas comparison (I believe 100% of the results cited in the prose, many in bold, are with mismatched precision and batch size), with minimal articulation of the many axes of nuance here. Precision isn't defined at all in the LSTM case, and could easily be the cause of the failure of the TPU run to converge where the GPU runs do. To a non-expert audience I think the end result is confusing and misleading.

Also, while I certainly agree that the performance/dollar comparison is highly relevant to customers at a given instant, that may only tell us that Google is subsidizing this hardware now that they've deployed it, and/or that, lacking serious competition, NVIDIA has been building crazy margins into their P100/V100 prices. In understanding fundamental technological tradeoffs, and even the limits of what the pricing in a more competitive market could be, it is relevant to compare performance per unit of hardware resources (mm^2, die/package, watt, GB of HBM, etc.)

In short, these comparisons are hard, and there is no one which tells a complete story. I pushed back because, while the post includes some nuance, it brushes a great deal under the rug and focuses primarily on problematic comparison

(Further disclosure: I'm at least the third person in this sub-thread with some Google Brain/Cloud affiliation. I am speaking in my independent academic voice. I also think TPUs are great, having them publicly available now is great, and competition and diversity of architectural approach in accelerators is great. I appreciate the effort of the authors, but think the subtlety of these comparisons requires serious care.)

2 comments

(1) I think we should all collectively agree to ignore the LSTM results -- without the model converging, it's impossible to say whether or not the bug that prevented convergence also affected speed. I can build an arbitrarily fast LSTM that doesn't converge in a few print statements. :-)

(2) Unless Google releases the specs to the h/w, I'd argue that cost is our best proxy. But if you assume that both Google and Amazon want to make a profit on their cloud rentals, it at least gives us a way to get to something we can normalize to (the V100's list price is public, though who knows how much Amazon pays). And, given that you can't buy a Cloud TPU, the price Google charges really is the meaningful answer. It doesn't tell us about fundamentals, but it's the right answer from a consumer standpoint.

I think it's a fair bigger-picture question to ask how we fairly and informatively benchmark cloud-only services in ways that we can not only get consumer-oriented price comparisons, but also learn from the underlying technical choices. The longer-term answer is that we beg Google to write a paper about TPUv2, as they (surprisingly!) did about TPUv1 -- because without that, we just get black box numbers combined with informed speculation based upon glossy board and heatsink photos.

btw - the best current source of specs about TPUv2 was Jeff's NIPS talks: http://learningsys.org/nips17/assets/slides/dean-nips17.pdf

Which mentions a few details like 16GB HBM per chip with 600GB/s memory bandwidth.

(3) I agree completely with you that the comparisons are hard. I'm very glad the authors of the blog post are listening to the feedback they're getting here -- on the LSTM, on batch size comparisons, and about precision and being clear about which things they're measuring.

(Reminder disclosure: It's awkward talking about Google in the third person since they pay me part time, but I'm trying to take this discussion with my academic hat also. This nested series of disclaimers is an amusing commentary about how small the machine learning + systems community is.)

Thank you for your feedback! (author here)

Our intention is really to provide a sound comparison. I think we agree that these kinds of comparisons can be hard given the constraints (e.g., lack of available technical information on TPUv2 or public implementations of optimized models for certain architectures). As I stated elsewhere, we are collecting all of the feedback and will run additional experiments.

If you know of an implementation of a mixed-precision/fp16 model that you'd like to see results for, please let us know! I may also reach out directly to you for that if you don't mind.