| Wow! Thanks for taking the time on the long reply. "You are defining 'cost' far too narrowly " Agree but this is opposite here as Google has the advantageous compared to Nvidia. So something does not add up? I mean what you pointed out is a disadvantage for Nvidia? I mean heck the canonical AI framework Google controls. 100k stars on Github is just incredible and can only think of K8s doing anything similar? A big difference I think you are missing is Google did NOT run their operation on Google Fiber. But they do on the TPUs. Google has over 4k production NN and the amount of money they save running them at less than 1/2 the cost for their own stuff versus using Nvivida is a huge amount of money. But also keeps growing. Google has a fundemental advantage over their competitors having the TPUs. A perfect example is their new text to speech. Speech using a NN at 16k samples a second at a reasonable price would be impossible without the TPUs. https://cloudplatform.googleblog.com/2018/03/introducing-Clo... Does NOT appear Volta is in striking distance. But more importantly Google will do a gen 3 and 4, etc. They have the data to iterate and Nvidia just does not. But more importantly Google does the entire stack and Nvidia does NOT. AI it is so important to do the entire stack for efficiency reasons. Plus Google controls the canonical AI framework with TF. "Google's business strategy only allows it to spread development costs over its own deployment " Well that clearly is untrue. Just take their text to speech sold as a service and the cost of doing on Nvidia would have been prohibitive. They could not even offer the service without the TPUs. "Do you work on the TPU team or something? " No. But I have been running into all this hate for Google with the alt right all around firing Damore that logic is lost. I was talking to a Russian this morning on Reddit and he was delusional because of his hate for Google based on him thinking they are left wing extremest. "they have the resources to beat Google it its own game " This is the exact problem for Nvidia. They do not have the resources to compete. That is the exact problem. Chips will come from the big players and NOT third parties in the future. The entire dynamics of the industry have changed and actually a lot more like the past ironically. Google, Amazon, FB and other big players will do their own silicon. Even Tesla is suppose to do the same. The reason is because the people that buy the chips now run the chips which was NOT true in the past. Use to be a Dell purchased from Intel and sold the machine to someone. The big difference today is the users of the systems are centralized with the big cloud providers. So they now get the data to improve the chips which just was not true in the past. Plus it is looked at as being a competitive advantage. So Apple does their own. Google does their own including the PVC on the device. Amazon and FB will also do their own. Google did the same thing years ago with networking. They quietly hired the Lanai team to build all their own network silicon which significantly lower their cost. Heck Google then created their own network stack to make it determinate. It is how it was possible to create Spanner. Tech companies are so much bigger today they have the resources to do all their own stuff and own every layer of the stack instead of using third parties. Google could never be what they are today if they had not built their own stuff. Could you imagine the cost of using SAN instead of them creating GFS? |
Yes, Google has a large ML deployment, but so does Nvidia, which is not (currently) focused on direct-to-consumer public APIs, but actually doing deep learning and simulation at scale.
The hyperscaler approach to ML is not the only possible way to scale up, Nvidia chose to go the HPC/supercomputing route and basically built their own supercomputer from the ground up.
Both approaches have their advantages and drawbacks, but one thing that supercomputing approaches have is a focus on vertical scalability. It's not just about samples/second, but how big can you feasibly make and train an NN? Note that the national research labs are getting into the act, and those supercomputers are basically built in close collaboration with Nvidia [1].
I would really recommend spending some time on their website and watching some of their videos, e.g. [2]. Jensen Huang is completely bought into deep learning and NN and has re-oriented its company towards making sure Nvidia can dominate the space.
> They have the data to iterate and Nvidia just does not
This is where I fundamentally disagree with you. This was true 3 years ago, but not today, mostly because Nvidia is the default option for ML researchers right now and they are slowly but steadily enticing everyone to collaborate with them (not to mention their self-driving efforts, which generate troves of data directly).
> Just take their text to speech sold as a service and the cost of doing on Nvidia would have been prohibitive.
That's on their own deployment.
Google is #3 in the cloud space right now. It's Nvidia-powered AWS + Azure ML deployments competing against Google, which also deploys V100s as well as TPUs.
Although it's possible for a single vertically integrated player to beat the rest of the market (e.g. Apple) for a long period of time, it's a difficult, risky proposition and it usually helps if they started out with a huge advantage, which Google doesn't seem to have since they're starting at #3 in the cloud space.
> hey do not have the resources to compete. That is the exact problem.
I think, perhaps, you are still imaging the company as it was in 2012 or 2015, but the company's revenues and profits have grown substantially in the past years.
Nvidia's market cap is $132bn and they have a profit run rate of about $4bn - 5bn / yr.
Their R&D spend has averaged about $2bn / yr for the last 5 years or so; in fact they beat AMD/ATI into the ground while spending less on R&D. They can basically triple the amount of money their pour into research if they wanted to.
By comparison, Google spends about $15bn/yr on R&D, but that's split across far more projects.
> Google does the entire stack and Nvidia does NOT.
I'm going to have to strongly disagree with you on that one.
Google owns more of the deep learning end-to-end cloud stack, but they do not own more of the hardware, software, or firmware stack for accelerated computing.
Which 'ecosystem', the easier access to data (which Google does have) vs. controlling the hardware + frameworks + partnerships, is an open question. I tend to believe the latter, because Nvidia has many options to get its hands on data (they can partner with the other cloud providers), while Google would have to invest quite a bit to compete on Nvidia's terms.
The easiest example, which I keep coming back to and which you haven't addressed, is how is Google going to compete on memory fabric and node architecture? Nvidia is out there building NVLink, NVSwitch, and basically their own supercomputing nodes (DGX-2).
They are working ORNL to build some of the largest Volta deployments in the world, so they are rapidly building experience on doing deep learning at large scale as well. How would Google be able to match this if NN/DL development turns out to scale vertically (and we are seeing this in rapid YoY growth of layer depth and network size in DL).
Again, TF is not really a direct advantage for Google because it runs equally well on Nvidia hardware. If Google is so confident in the TPU winning out, why are they busy deploying Voltas in GCE?
If you want to do deep learning today, Nvidia is the go-to option because every deep learning framework is on CUDA, including cuDNN. If I want to use the TPU, I am stuck using GCE + Tensorflow (although Keras / PyTorch may soon have support), but with Nvidia I have the choice between every single cloud provider or my own local deployment, which is always ultimately cheaper than paying for cloud time. Google seems unlikely to sell you a TPU for your own DL box.
> Google, Amazon, FB and other big players will do their own silicon
It's certainly an interesting space now. MSFT is busy buying FPGAs from Xilinx and Intel/Altera as part of their strategy. Ultimately though, you seem to think that Nvidia is still a niche GPU maker from 2013 or so; it's not, it is larger than Tesla and certainly has more than enough funding, plus a very focused execution team and CEO.
> Google could never be what they are today if they had not built their own stuff. Could you imagine the cost of using SAN instead of them creating GFS?
I agree that the hyperscalers found significant savings by looking up the stack, but that has limits. They aren't building their own CPUs, for example. Chipmaking is a very, very expensive game.
[1] https://insidehpc.com/2018/01/using-titan-supercomputer-acce...
[2] https://www.youtube.com/watch?v=Rn73n1HYYNs