| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by smallnamespace 2977 days ago

Nvidia is actively building an entire deep learning stack internally, all the way to releasing a self-driving simulation platform which they are using to build their own self-driving software [1].

I think they are actually farther along and more aggressive about exploring deep learning use cases in production than Google today; augmenting real data with extensive simulation is really a far-reaching idea that comes directly from their gaming experience.

> So money is not an issue. It is tiny in the scheme of things.

Money of course is always an issue long term; otherwise why doesn't Google Fiber just spend tens of billions of dollars to build out its nationwide network? Because it will see negative ROI even if they succeed.

The TPU has to eventually make a real return to Google, and it won't if nvidia can spend the same amount of money and build a faster product and sell it to all the other cloud players, which I believe they definitely can.

Put another way, the TPU has to be cheaper to Google than buying nvidia GPUs after factoring in its development costs, whereas nvidia gets to amortize those dev costs over all other cloud providers and all other GPU customers. Google isn't about to sell the TPU to other cloud providers; the entire idea is to use it to drive Google Cloud adoption.

The TPU is a fine chip, but if you just look at the big picture, there is every sign that nvidia could build the same or better product for less money because it has far more synergies across the hardware and chip design stack; e.g. the TPU only has PCIe connectors, while nvidia has already worked with IBM to get NVLink into supercomputers [2]. For some workloads the TPU will likely be bandwidth-starved communicating with the CPU and main memory.

[1] https://nvidianews.nvidia.com/news/nvidia-introduces-drive-c...

[2] https://www.ibm.com/us-en/marketplace/power-systems-ac922/de...

1 comments

jacksmith21006 2977 days ago

The problem is Nvidia is never going to have the AI expertise up and down the stack like Google.

As far as I am aware Nvidia does not even run a cloud do they? Obviously never going to have the production NN that Google has.

Google now has well over 4k NN in production and not sure if Nvidia has any? Well over a billion a day are using the Google NN. That data allows Google to iterate in ways that Nvidia just never would be able to.

But this was all theory and why starting to see a little more concrete results like this where Google with their TPUs able to charge 1/2 the price of using Nvidia is value. Then we also have the paper from Google on the Gen 1.

I would guess Google is working on a gen 3. Nvidia is trying to catch a moving target but without the data. So they are behind, trying to catch up, but missing an arm.

A perfect example of this phenomenon is Capsule network pioneered by Hinton. They use dynamic routing which is potentially going to require different approach to memory access as the pattern would be different than CNN or RNN.

Today the problem is memory access and no longer instruction execution. Google nailed the low hanging fruit with the Gen 1 TPUs. They have 65536 very simple cores. Now you have to go after memory access.

Your post is all over the place so a bit hard to respond. Google Fiber was NOT about cost. It was about AT&T and other established players with some local governments making it difficult for Google to access what they needed to be able to compete.

I hate debating something with someone that is doing what you are doing. Google Fiber? Really?

"I think they are actually farther along and more aggressive about exploring deep learning"

I do a LOT of surfing on sites and can easily say this is the craziest thing I have read in a bit. You are honestly comparing Nvidia to Google? Really?

Google solved Go a decade early. Hinton did the Capsule networks and basically the farther of DL. Well made it actually work. What breakthrough came from Nvidia?

A single one?

There is so much crazy stuff in your posts this must be driven by something else and something emotional? Your points are just not based on reality. Is this really about Google firing Damore?

BTW, Nvidia read the Google Gen 1 TPU paper and why we see them doing similar things. But Google is going to move to addressing the memory access problems as that is the next area to improve. Once Google figures it out then you will see Nvidia just copy the approach like they are doing with the gen 1 TPUs.

I listened to this Nvidia presentation on YouTube and they were basically quoting the Google TPU paper. Talking about using 8 bit, integers, etc, for inference.

Google will release the gen 3 and then share a paper on the gen 2 and we will see Nvivida then try to copy that one. Nvidia always a couple of steps behind.

But I am a super curious person and can you share what this is really all about?

link

smallnamespace 2977 days ago

Well, that's quite a lot to digest.

I'm not sure why you think I must be conspiratorial, although I will admit the thesis that 'Nvidia is an AI leader in software' is unusual, but ultimately I think well-supported by the public record and some diligent research.

I've been watching Nvidia for awhile, and one thing you notice quickly is that, much like Apple, they don't pre-announce or oversell vaporware; they tend to only announce things that they have already worked on for years and are imminently available.

> As far as I am aware Nvidia does not even run a cloud do they?

They don't run a public cloud yet, although they are making noises in that direction [1]. GPU Cloud right now is just a place where you get packaged Docker images (and then run them on AWS, GCE, what have you), but I don't think the branding is accidental—they are setting it up so if they decide to build a public cloud, ML researchers will already be familiar with the term.

They are also doing distributed cloud GPUs direct to consumer via Cloud Gaming [2].

Internally, they have gone the HPC/supercomputing route to develop their own ML stack, rather than Google/MS/AWS hyperscaler route [3]. They basically built their own supercomputer based on Voltas, and they use it internally to do everything from developing self-driving car software [4], including the simulation platform.

Note that AFAIK, the simulation platform is far ahead of other players in the field. We have heard time and again that 'data' is going to be the competitive advantage to Tesla (miles driven) and Waymo (mapping data). What if you can partially sidestep the issue by leveraging the ability of humans to actually define dangerous scenarios and rigorously test them outside of the constraints of road driving?

The platform literally has literally built the idea of 'regression testing' and translated it into the ML space and they are planning to deploy this into production systems in the next 1-2 years. From what I've heard from ML researchers, the end-to-end testing and deployment of NNs is still rather in its infancy, in terms of being able to change your network and then do mass inferencing on prior 'test cases' that you think are important.

> Google Fiber was NOT about cost. It was about AT&T and other established players with some local governments making it difficult for Google to access what they needed to be able to compete.

You are defining 'cost' far too narrowly, or rather not seeing how non-economic costs eventually translate into economic ones. The established players made it difficult for Google. This eventually translated into 1) higher legal fees to fight them 2) slower deployment rates and 3) higher operational costs for expansion. All these things obviously cost lots of time money and sharply lower the overall ROI of a project, hence why Google has essentially given up. There's only risk, no reward.

The point is not to compare the TPU project directly to Fiber (the two projects are very different), but just to address your point that 'cost doesn't matter to Google because they have a lot of money'. Companies that truly don't care about cost will very soon end up with very little money. Put another way, I don't think the eventual reward from continuing TPU development will be more profitable than simply buying GPUs from Nvidia down the line.

> Now you have to go after memory access.

Nvidia might be better-positioned to optimize memory access than Google is, they have their own fabric and work with a large variety of partners to optimize their ML/DL workloads.

> Your post is all over the place so a bit hard to respond.

Well, the crux of my argument is that:

1. Chip development is an expensive business

2. Nvidia is good at building chips; the Volta is already within striking distance of the TPU using only ~25% of its die area for tensor units. As NNs grow, inter-node scalability will become more important, and Nvidia has large advantages in interconnect that will show up in large-scale deployments (like supercomputers, where I expect a lot of DL to happen)

3. Google's business strategy only allows it to spread development costs over its own deployment, while Nvidia lets many other players pay for the dev cost, including competing hyperscalers, HPC, gamers, and carmakers. Nvidia's potential 'ecosystem' is much larger than Google's. Historically, we've seen that structural advantage be very hard to surmount.

1-3 means that in the long run, a 'go-it-alone' strategy like Google's is unlikely to win a protracted R&D fight.

> Google solved Go a decade early. Hinton did the Capsule networks and basically the farther of DL. Well made it actually work. What breakthrough came from Nvidia?

Yes, Deepmind has made some great strides, but how does that directly fund TPU development and give it a competitive advantage? The fact that those papers are published means that any talented researcher at Nvidia can replicate the work, then run and optimize it on their GPU architecture.

> There is so much crazy stuff in your posts this must be driven by something else and something emotional? Your points are just not based on reality. Is this really about Google firing Damore?

I'm not sure why you are so convinced that only a crazy person with a beef about Google can have a differing opinion from you. Do you work on the TPU team or something?

> Google will release the gen 3 and then share a paper on the gen 2 and we will see Nvivida then try to copy that one. Nvidia always a couple of steps behind.

Where's your evidence that Nvidia is simply copying Google, rather than both engineering teams viewing the same problems and converging to similar solutions?

Note that even if it is true that Nvidia is simply 'copying Google', they have the resources to beat Google it its own game, by leveraging process, memory, CUDA, etc. You've studiously avoided addressing this point.

[1] https://www.nvidia.com/en-us/gpu-cloud/deep-learning-contain...

[2] http://www.nvidia.com/object/cloud-gaming.html

[3] https://www.nextplatform.com/2017/11/30/inside-nvidias-next-...

[4] https://www.youtube.com/watch?v=booEg6iGNyo

link

jacksmith21006 2976 days ago

Wow! Thanks for taking the time on the long reply.

"You are defining 'cost' far too narrowly "

Agree but this is opposite here as Google has the advantageous compared to Nvidia. So something does not add up? I mean what you pointed out is a disadvantage for Nvidia? I mean heck the canonical AI framework Google controls. 100k stars on Github is just incredible and can only think of K8s doing anything similar?

A big difference I think you are missing is Google did NOT run their operation on Google Fiber. But they do on the TPUs. Google has over 4k production NN and the amount of money they save running them at less than 1/2 the cost for their own stuff versus using Nvivida is a huge amount of money.

But also keeps growing. Google has a fundemental advantage over their competitors having the TPUs. A perfect example is their new text to speech.

Speech using a NN at 16k samples a second at a reasonable price would be impossible without the TPUs.

https://cloudplatform.googleblog.com/2018/03/introducing-Clo...

Does NOT appear Volta is in striking distance. But more importantly Google will do a gen 3 and 4, etc. They have the data to iterate and Nvidia just does not.

But more importantly Google does the entire stack and Nvidia does NOT. AI it is so important to do the entire stack for efficiency reasons. Plus Google controls the canonical AI framework with TF.

"Google's business strategy only allows it to spread development costs over its own deployment "

Well that clearly is untrue. Just take their text to speech sold as a service and the cost of doing on Nvidia would have been prohibitive. They could not even offer the service without the TPUs.

"Do you work on the TPU team or something? "

No. But I have been running into all this hate for Google with the alt right all around firing Damore that logic is lost.

I was talking to a Russian this morning on Reddit and he was delusional because of his hate for Google based on him thinking they are left wing extremest.

"they have the resources to beat Google it its own game "

This is the exact problem for Nvidia. They do not have the resources to compete. That is the exact problem.

Chips will come from the big players and NOT third parties in the future.

The entire dynamics of the industry have changed and actually a lot more like the past ironically.

Google, Amazon, FB and other big players will do their own silicon. Even Tesla is suppose to do the same.

The reason is because the people that buy the chips now run the chips which was NOT true in the past. Use to be a Dell purchased from Intel and sold the machine to someone.

The big difference today is the users of the systems are centralized with the big cloud providers. So they now get the data to improve the chips which just was not true in the past.

Plus it is looked at as being a competitive advantage.

So Apple does their own. Google does their own including the PVC on the device. Amazon and FB will also do their own.

Google did the same thing years ago with networking. They quietly hired the Lanai team to build all their own network silicon which significantly lower their cost.

Heck Google then created their own network stack to make it determinate. It is how it was possible to create Spanner.

Tech companies are so much bigger today they have the resources to do all their own stuff and own every layer of the stack instead of using third parties.

Google could never be what they are today if they had not built their own stuff. Could you imagine the cost of using SAN instead of them creating GFS?

link

smallnamespace 2976 days ago

> Google has over 4k production NN and the amount of money they save running them at less than 1/2 the cost for their own stuff versus using Nvivida is a huge amount of money.

Yes, Google has a large ML deployment, but so does Nvidia, which is not (currently) focused on direct-to-consumer public APIs, but actually doing deep learning and simulation at scale.

The hyperscaler approach to ML is not the only possible way to scale up, Nvidia chose to go the HPC/supercomputing route and basically built their own supercomputer from the ground up.

Both approaches have their advantages and drawbacks, but one thing that supercomputing approaches have is a focus on vertical scalability. It's not just about samples/second, but how big can you feasibly make and train an NN? Note that the national research labs are getting into the act, and those supercomputers are basically built in close collaboration with Nvidia [1].

I would really recommend spending some time on their website and watching some of their videos, e.g. [2]. Jensen Huang is completely bought into deep learning and NN and has re-oriented its company towards making sure Nvidia can dominate the space.

> They have the data to iterate and Nvidia just does not

This is where I fundamentally disagree with you. This was true 3 years ago, but not today, mostly because Nvidia is the default option for ML researchers right now and they are slowly but steadily enticing everyone to collaborate with them (not to mention their self-driving efforts, which generate troves of data directly).

> Just take their text to speech sold as a service and the cost of doing on Nvidia would have been prohibitive.

That's on their own deployment.

Google is #3 in the cloud space right now. It's Nvidia-powered AWS + Azure ML deployments competing against Google, which also deploys V100s as well as TPUs.

Although it's possible for a single vertically integrated player to beat the rest of the market (e.g. Apple) for a long period of time, it's a difficult, risky proposition and it usually helps if they started out with a huge advantage, which Google doesn't seem to have since they're starting at #3 in the cloud space.

> hey do not have the resources to compete. That is the exact problem.

I think, perhaps, you are still imaging the company as it was in 2012 or 2015, but the company's revenues and profits have grown substantially in the past years.

Nvidia's market cap is $132bn and they have a profit run rate of about $4bn - 5bn / yr.

Their R&D spend has averaged about $2bn / yr for the last 5 years or so; in fact they beat AMD/ATI into the ground while spending less on R&D. They can basically triple the amount of money their pour into research if they wanted to.

By comparison, Google spends about $15bn/yr on R&D, but that's split across far more projects.

> Google does the entire stack and Nvidia does NOT.

I'm going to have to strongly disagree with you on that one.

Google owns more of the deep learning end-to-end cloud stack, but they do not own more of the hardware, software, or firmware stack for accelerated computing.

Which 'ecosystem', the easier access to data (which Google does have) vs. controlling the hardware + frameworks + partnerships, is an open question. I tend to believe the latter, because Nvidia has many options to get its hands on data (they can partner with the other cloud providers), while Google would have to invest quite a bit to compete on Nvidia's terms.

The easiest example, which I keep coming back to and which you haven't addressed, is how is Google going to compete on memory fabric and node architecture? Nvidia is out there building NVLink, NVSwitch, and basically their own supercomputing nodes (DGX-2).

They are working ORNL to build some of the largest Volta deployments in the world, so they are rapidly building experience on doing deep learning at large scale as well. How would Google be able to match this if NN/DL development turns out to scale vertically (and we are seeing this in rapid YoY growth of layer depth and network size in DL).

Again, TF is not really a direct advantage for Google because it runs equally well on Nvidia hardware. If Google is so confident in the TPU winning out, why are they busy deploying Voltas in GCE?

If you want to do deep learning today, Nvidia is the go-to option because every deep learning framework is on CUDA, including cuDNN. If I want to use the TPU, I am stuck using GCE + Tensorflow (although Keras / PyTorch may soon have support), but with Nvidia I have the choice between every single cloud provider or my own local deployment, which is always ultimately cheaper than paying for cloud time. Google seems unlikely to sell you a TPU for your own DL box.

> Google, Amazon, FB and other big players will do their own silicon

It's certainly an interesting space now. MSFT is busy buying FPGAs from Xilinx and Intel/Altera as part of their strategy. Ultimately though, you seem to think that Nvidia is still a niche GPU maker from 2013 or so; it's not, it is larger than Tesla and certainly has more than enough funding, plus a very focused execution team and CEO.

> Google could never be what they are today if they had not built their own stuff. Could you imagine the cost of using SAN instead of them creating GFS?

I agree that the hyperscalers found significant savings by looking up the stack, but that has limits. They aren't building their own CPUs, for example. Chipmaking is a very, very expensive game.

[1] https://insidehpc.com/2018/01/using-titan-supercomputer-acce...

[2] https://www.youtube.com/watch?v=Rn73n1HYYNs

link

jacksmith21006 2975 days ago

Not aware of Nvidia having any where the number of neural networks in production or the nearly the number of users.

Not even sure where they are hosting them or even what they do? How about some color as you have me curious?

Have watched videos of Jensen but also watched an excellent almost 2 hour presentation from one of their VPs. He said a lot of things that were in the Google TPU paper which I found a bit funny. How you can use 8 bits and integers for inference for example. Said to me these guys are trying to catch up.

The problem is Amazon has that data NOT Nvidia. It is not in Amazon best interest to help Nvidia this is my exact point. The entire dynamics of the chip business have changed. You will see Amazon do their own just like Google has.

Once Google did the gen 1 TPUs they set the direction of you just can NOT buy off the shelve and compete long term.

The silicon is strategic for AI.

MS went the wrong direction in using a FPGA solution in addition to using Nvidia. But once again no data for Nvidia.

Market cap does not give you the money. But Google in 2018 will spend about 2x Nvidia 2017 sales! Yes you read that correct. Google on R&D will spend 2x Nvidia 2017 sales!

Google profits will be over 4x Nvidia total 2017 sales.

Once again Nvidia does NOT do the entire stack. I am not aware of any algorithm breakthroughs that came from Nvidia. I can not even name one AI expert Nvidia.

But the score board is papers excepted at NIPS. Nvidia did NOT get a single paper accepted that I saw at the conference?

Versus Google had more than anyone. 9% of all the paper accepted came from Google.

https://medium.com/machine-learning-in-practice/nips-accepte...

If Nvidia is playing in the entire stack how could they NOT get a single paper accepted at NIPS?

Or did I miss it?

If we look at Self Driving cars one of the most important AI applications Nvidia does not even show up on patents? Once again Google ahead by a mile.

https://www.theatlas.com/charts/r1iEkmKkz

Something in your post does NOT add up? Why if Nvidia is a player in the stack besides the silicon why do they NOT show up any where?

Google deploys both, TPUs and Nvidia, for a number of reasons I suspect.

The biggest is they want TF to be the canonical framework for AI and they MUST show not favoring their own solution until it is a done deal which is getting close.

In the TF will never run as well on Nvidia as they will on the TPUs. We can see hit here with about 1/2 the cost using the TPUs over Nvidia.

It is like saying Android would run as well as iOS on the Apple processors. It is all about controlling the entire stack like Apple has done and Nvidia is just not in a position to be able to.

Makes no sense to buy the processors so would not make any sense for Google to sell them to others. Not going to ever see that happen.

But I do think it is possible Google will sell the PVCs.

The ultimate problem is Nvidia is in perceptual catching up. Right now the big new thing that came from Hinton is Capsule networks and using dynamic routing. Google will have that optimized in silicon long before Nvidia will.

I suspect it will create the need for a different approach how you access memory in chip architecture.

But Capsule networks are heavy computationally and so silicon will matter a lot. Google has the algorithms and how they want to use in production at scale and then the money to execute in supporting in silicon. They just move way too fast for Nvidia to ever be able to catch up.

link

smallnamespace 2975 days ago

I'd really like to hear what you think about Nvidia's approach to self-driving, especially using supercomputing + simulation + backtesting to bootstrap the process. We keep going back and forth on this topic, but how can you develop a self-driving platform without a bunch of NNs in production, running on the Nvidia Saturn V supercomputer?

> How you can use 8 bits and integers for inference for example. Said to me these guys are trying to catch up.

I think it's interesting that you presume that only Google came up with the idea first, rather than 'reducing precision' to be a rather obvious idea that any chip designer or ML practitioner would have brought up. Again, can you please justify that?

I think where we're at a disconnect is that you equate AI leadership with publishing and patents, while looking at Nvidia, they are an extremely secretive organization that would probably avoid publishing what they see as a competitive advantage. This is similar to how Apple operates.

I used to work at finance, and the culture was the same way—banks had state-of-the-art models internally but would never share it. Published papers in academia were probably ~5 years behind what the banks had.

I do believe that Google (mostly Deepmind) is the leader in the research field, but note that they had to go out and buy that expertise.

> Google on R&D will spend 2x Nvidia 2017 sales!

Yes, but it's not all going into AI for sure, and definitely not into bankrolling the TPU effort. We should compare apples to apples here, surely?

> The entire dynamics of the chip business have changed. You will see Amazon do their own just like Google has.

So what about Nvidia's self-driving efforts? I've talked about it for about 3-4 posts now, with references to presentations and videos, and heard more or less crickets from you about it. I don't see how you can repeatedly say that Nvidia has no access to data when they clearly have a working product (Drive PX2) already, plus more (Drive Xavier) ready to be deployed in cars within the next ~18 months.

> Google deploys both, TPUs and Nvidia, for a number of reasons I suspect.

> The biggest is they want TF to be the canonical framework for AI and they MUST show not favoring their own solution until it is a done deal which is getting close.

Yes, but for those exact same reasons, the TPU will not be a strategic edge for Google and lower the ROI of working on the project.

You can't have it both ways: either the TPU is the secret sauce that drives Google Cloud adoption and gives them a big leg up in AI (in which case, they would want to leverage TF and make it 'run better' on the TPU than on other hardware), or else TF is a neutral platform and it doesn't benefit either party (which I actually agree with).

> It is like saying Android would run as well as iOS on the Apple processors. It is all about controlling the entire stack like Apple has done and Nvidia is just not in a position to be able to.

I think the analogy here is really apt, but also shows why I don't believe in Google's success here long-term.

The iPhone basically invented the smartphone market; its product was 10x better than any other competitor when it was introduced, and it was probably the majority of volume (and definitely profit) for years before Android was able to compete.

The TPU is not heads and shoulders above the competition. The Volta came out literally ~1 year after Pascal and had 10X the tensor throughput; you say that Google isn't standing still, but certainly neither will Nvidia.

Basically, Google is not starting from a 'commanding lead' position like Apple did. And we see today that even though Apple still leads in profits, Samusng is very close, and, Android is the vast majority of the market.

Larger ecosystems tend to beat fully vertical stacks in the long term. We see this across many markets and products. So why do you think this will be the exception?

link