| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by smallnamespace 2976 days ago

> About 1/2 the cost for similar performance.

I would expect a dedicated accelerator to need at least a 5-10X advantage to outweigh all the other infrastructure and ecosystem costs.

GPUs are more useful for a wide variety of data-parallel tasks, and many more NN frameworks work on top of CUDA than work on the TPU.

In terms of horizontal scalability, nvidia has been rapidly iterating on increasing both memory and interlink bandwidth (including NVSwitch [1]), while each 'TPU' is actually 4 chips interconnected so likely has less upward scalability.

Also note that the tensor cores on a V100 take roughly 25-30% of the actual area. If Nvidia wanted to, they could probably easily make a pure tensor chip that beat the TPU in performance, could be produced in volume on their existing process, and also had full compatibility with their entire stack.

All in all, a 2x price/performance advantage for a hyper-specialized accelerator is basically a loss, just like how nobody installs a Soundblaster card anymore, how consumer desktops don't run discrete GPUs even though integrated graphics are a few times slower, or

[1] https://www.nextplatform.com/2018/04/04/inside-nvidias-nvswi...

1 comments

hencoappel 2976 days ago

If that 2x price/performance scales for all of Google's inferencing then it is definitely not a loss for them. If they can halve their running costs for inferencing then they are saving themselves a ton of money. Their TPUv2 was announced slightly before the V100 and the money savings they make by not paying Nvidia premiums probably helps. From the customer point of view, what is a GPU other than a specialised accelerator. Without more details we can't know how a TPU really compares, but if your aim is to train/run inference of Tensorflow models, then they're a really competitive product at the moment.

link

smallnamespace 2975 days ago

I agree, but chip development is an expensive business. There is nothing preventing Nvidia from immediately turning around and building a specialised ML accelerator with better software integration and higher bandwidth. For all we know they could already be working on one.

link

jacksmith21006 2975 days ago

They already did two generations. Google has over $100B in the bank with less than $4B debt. So money is not an issue. It is tiny in the scheme of things.

Google has an advantage as they do the entire stack and can better optimize like we see here with half the cost.

link

smallnamespace 2975 days ago

Nvidia is actively building an entire deep learning stack internally, all the way to releasing a self-driving simulation platform which they are using to build their own self-driving software [1].

I think they are actually farther along and more aggressive about exploring deep learning use cases in production than Google today; augmenting real data with extensive simulation is really a far-reaching idea that comes directly from their gaming experience.

> So money is not an issue. It is tiny in the scheme of things.

Money of course is always an issue long term; otherwise why doesn't Google Fiber just spend tens of billions of dollars to build out its nationwide network? Because it will see negative ROI even if they succeed.

The TPU has to eventually make a real return to Google, and it won't if nvidia can spend the same amount of money and build a faster product and sell it to all the other cloud players, which I believe they definitely can.

Put another way, the TPU has to be cheaper to Google than buying nvidia GPUs after factoring in its development costs, whereas nvidia gets to amortize those dev costs over all other cloud providers and all other GPU customers. Google isn't about to sell the TPU to other cloud providers; the entire idea is to use it to drive Google Cloud adoption.

The TPU is a fine chip, but if you just look at the big picture, there is every sign that nvidia could build the same or better product for less money because it has far more synergies across the hardware and chip design stack; e.g. the TPU only has PCIe connectors, while nvidia has already worked with IBM to get NVLink into supercomputers [2]. For some workloads the TPU will likely be bandwidth-starved communicating with the CPU and main memory.

[1] https://nvidianews.nvidia.com/news/nvidia-introduces-drive-c...

[2] https://www.ibm.com/us-en/marketplace/power-systems-ac922/de...

link

jacksmith21006 2975 days ago

The problem is Nvidia is never going to have the AI expertise up and down the stack like Google.

As far as I am aware Nvidia does not even run a cloud do they? Obviously never going to have the production NN that Google has.

Google now has well over 4k NN in production and not sure if Nvidia has any? Well over a billion a day are using the Google NN. That data allows Google to iterate in ways that Nvidia just never would be able to.

But this was all theory and why starting to see a little more concrete results like this where Google with their TPUs able to charge 1/2 the price of using Nvidia is value. Then we also have the paper from Google on the Gen 1.

I would guess Google is working on a gen 3. Nvidia is trying to catch a moving target but without the data. So they are behind, trying to catch up, but missing an arm.

A perfect example of this phenomenon is Capsule network pioneered by Hinton. They use dynamic routing which is potentially going to require different approach to memory access as the pattern would be different than CNN or RNN.

Today the problem is memory access and no longer instruction execution. Google nailed the low hanging fruit with the Gen 1 TPUs. They have 65536 very simple cores. Now you have to go after memory access.

Your post is all over the place so a bit hard to respond. Google Fiber was NOT about cost. It was about AT&T and other established players with some local governments making it difficult for Google to access what they needed to be able to compete.

I hate debating something with someone that is doing what you are doing. Google Fiber? Really?

"I think they are actually farther along and more aggressive about exploring deep learning"

I do a LOT of surfing on sites and can easily say this is the craziest thing I have read in a bit. You are honestly comparing Nvidia to Google? Really?

Google solved Go a decade early. Hinton did the Capsule networks and basically the farther of DL. Well made it actually work. What breakthrough came from Nvidia?

A single one?

There is so much crazy stuff in your posts this must be driven by something else and something emotional? Your points are just not based on reality. Is this really about Google firing Damore?

BTW, Nvidia read the Google Gen 1 TPU paper and why we see them doing similar things. But Google is going to move to addressing the memory access problems as that is the next area to improve. Once Google figures it out then you will see Nvidia just copy the approach like they are doing with the gen 1 TPUs.

I listened to this Nvidia presentation on YouTube and they were basically quoting the Google TPU paper. Talking about using 8 bit, integers, etc, for inference.

Google will release the gen 3 and then share a paper on the gen 2 and we will see Nvivida then try to copy that one. Nvidia always a couple of steps behind.

But I am a super curious person and can you share what this is really all about?

link

smallnamespace 2974 days ago

Well, that's quite a lot to digest.

I'm not sure why you think I must be conspiratorial, although I will admit the thesis that 'Nvidia is an AI leader in software' is unusual, but ultimately I think well-supported by the public record and some diligent research.

I've been watching Nvidia for awhile, and one thing you notice quickly is that, much like Apple, they don't pre-announce or oversell vaporware; they tend to only announce things that they have already worked on for years and are imminently available.

> As far as I am aware Nvidia does not even run a cloud do they?

They don't run a public cloud yet, although they are making noises in that direction [1]. GPU Cloud right now is just a place where you get packaged Docker images (and then run them on AWS, GCE, what have you), but I don't think the branding is accidental—they are setting it up so if they decide to build a public cloud, ML researchers will already be familiar with the term.

They are also doing distributed cloud GPUs direct to consumer via Cloud Gaming [2].

Internally, they have gone the HPC/supercomputing route to develop their own ML stack, rather than Google/MS/AWS hyperscaler route [3]. They basically built their own supercomputer based on Voltas, and they use it internally to do everything from developing self-driving car software [4], including the simulation platform.

Note that AFAIK, the simulation platform is far ahead of other players in the field. We have heard time and again that 'data' is going to be the competitive advantage to Tesla (miles driven) and Waymo (mapping data). What if you can partially sidestep the issue by leveraging the ability of humans to actually define dangerous scenarios and rigorously test them outside of the constraints of road driving?

The platform literally has literally built the idea of 'regression testing' and translated it into the ML space and they are planning to deploy this into production systems in the next 1-2 years. From what I've heard from ML researchers, the end-to-end testing and deployment of NNs is still rather in its infancy, in terms of being able to change your network and then do mass inferencing on prior 'test cases' that you think are important.

> Google Fiber was NOT about cost. It was about AT&T and other established players with some local governments making it difficult for Google to access what they needed to be able to compete.

You are defining 'cost' far too narrowly, or rather not seeing how non-economic costs eventually translate into economic ones. The established players made it difficult for Google. This eventually translated into 1) higher legal fees to fight them 2) slower deployment rates and 3) higher operational costs for expansion. All these things obviously cost lots of time money and sharply lower the overall ROI of a project, hence why Google has essentially given up. There's only risk, no reward.

The point is not to compare the TPU project directly to Fiber (the two projects are very different), but just to address your point that 'cost doesn't matter to Google because they have a lot of money'. Companies that truly don't care about cost will very soon end up with very little money. Put another way, I don't think the eventual reward from continuing TPU development will be more profitable than simply buying GPUs from Nvidia down the line.

> Now you have to go after memory access.

Nvidia might be better-positioned to optimize memory access than Google is, they have their own fabric and work with a large variety of partners to optimize their ML/DL workloads.

> Your post is all over the place so a bit hard to respond.

Well, the crux of my argument is that:

1. Chip development is an expensive business

2. Nvidia is good at building chips; the Volta is already within striking distance of the TPU using only ~25% of its die area for tensor units. As NNs grow, inter-node scalability will become more important, and Nvidia has large advantages in interconnect that will show up in large-scale deployments (like supercomputers, where I expect a lot of DL to happen)

3. Google's business strategy only allows it to spread development costs over its own deployment, while Nvidia lets many other players pay for the dev cost, including competing hyperscalers, HPC, gamers, and carmakers. Nvidia's potential 'ecosystem' is much larger than Google's. Historically, we've seen that structural advantage be very hard to surmount.

1-3 means that in the long run, a 'go-it-alone' strategy like Google's is unlikely to win a protracted R&D fight.

> Google solved Go a decade early. Hinton did the Capsule networks and basically the farther of DL. Well made it actually work. What breakthrough came from Nvidia?

Yes, Deepmind has made some great strides, but how does that directly fund TPU development and give it a competitive advantage? The fact that those papers are published means that any talented researcher at Nvidia can replicate the work, then run and optimize it on their GPU architecture.

> There is so much crazy stuff in your posts this must be driven by something else and something emotional? Your points are just not based on reality. Is this really about Google firing Damore?

I'm not sure why you are so convinced that only a crazy person with a beef about Google can have a differing opinion from you. Do you work on the TPU team or something?

> Google will release the gen 3 and then share a paper on the gen 2 and we will see Nvivida then try to copy that one. Nvidia always a couple of steps behind.

Where's your evidence that Nvidia is simply copying Google, rather than both engineering teams viewing the same problems and converging to similar solutions?

Note that even if it is true that Nvidia is simply 'copying Google', they have the resources to beat Google it its own game, by leveraging process, memory, CUDA, etc. You've studiously avoided addressing this point.

[1] https://www.nvidia.com/en-us/gpu-cloud/deep-learning-contain...

[2] http://www.nvidia.com/object/cloud-gaming.html

[3] https://www.nextplatform.com/2017/11/30/inside-nvidias-next-...

[4] https://www.youtube.com/watch?v=booEg6iGNyo

link