Hacker News new | ask | show | jobs
by bhouston 3036 days ago
It is hard for Google to make money on these TPUs as the whole engineering cost has to be made back from its pricing on Google Cloud, where as with NVIDIA it can pay back its engineering costs via multiple mature channels (games, super computers, and multiple cloud providers.)

I wonder which is higher, the cost for creating the TPUs in terms of engineering and manufacturing or the cost differential in terms of usage as compared to NVIDIA's latest?

I worry about Google long term here. I am surprised the TPU doesn't kick the ass of the NVIDIA chips.

9 comments

Disclosure: I work on Google Cloud.

By the logic above, you would conclude that TPUv1 (the inference-only chip) might have been a mistake, but we’ve been very public about how it “saved us from building lots of datacenters”.

That wasn’t ever sold as part of Cloud, so the benefit there is all from the second bit you mentioned: cheaper and more efficient than GPUs at the time. The paper also goes into more detail, but the size of that initial engineering team and time to market were both quite small.

For training, before Volta (and kind of Pascal), GPUs were the best option but not particularly efficient. Volta does the same “we should have a single instruction that does lots of math in one shot” by cleverly reusing the existing functional units. That the V100 is a great chip, is a good outcome for the whole industry. But GPUs aren’t (and shouldn’t be) just focused on ML. My bet is that there’s still a decent amount of runway left in specialized chips for ML, just as GPUs carved out their own niche versus CPUs.

But again, the “even just for Google” benefit is really enormous so I wouldn’t assume that Cloud has to pay for the entire effort. Could GPU manufacturers improve the cost:performance ratio of ML workloads enough that Google doesn’t have to build TPUs anymore? Perhaps, but like the V100 improvements that would be a great outcome!

Is there going to be an updated paper on performance per Watt, now that TPUv2 is public and V100 has been preannounced on the Google blog?
There's no real need to worry about Google in the long term - nVidia can make back their money solely with their GPUs; Google probably made their expenses back this weekend with searches around the Olympics. It'd be pointless for them to not use their TPUs themselves, and their main product, Adsense, uses ML.
Are you sure about Adsense? Talked to ad pros recently and they all complained Adsense is ancient (still mySQL?) and often broken; doesn't look like Google emphasizes it despite being their cash cow, more like deep state of neglect.
> still mySQL?

The F1 distributed database was developed to move the AdWords business off of MySQL.

https://research.google.com/pubs/pub41344.html

Adsense isn’t Google’s cash cow, Adwords is.
Ah OK, I might have been confused then. Thanks!
Google probably got back a lot of the engineering costs before it even rented out the first TPU, simply by virtue of running its own workloads, without having to buy tons of CPUs or GPUs. They're also very, very good at reducing computing resource waste (I know this firsthand).

I wouldn't be surprised if public TPUs are to some degree a way to print money: at least for a while, Google can probably just rent out its unused capacity that it had already planned and paid for. :-)

> I am surprised the TPU doesn't kick the ass of the NVIDIA chips.

30% cheaper e2e price for the company's first public offering, compared to the market leader's top-of-the-line chip sounds...pretty good to me?

30% list price. Who knows what the underlying margins are and how much cheaper Google can go with an offline agreement.
Since TPUs are used at Google to process data for its own service offerings (e.g. image classification, voice recognition, language translation, NLP, route planning, etc.) wouldn't it be fair to say that they will also be able to recoup the sunk costs (R&D) by purchasing fewer GPUs?
> TPU doesn't kick the ass of the NVIDIA chips

It used to until Volta came out with basically TPUs embedded on the board. We will see if AMD will join them as Vega in theory should be around Volta as well, just tooling is not there.

How long has Google had the TPUv2 for internal use? I was under the impression that V100 and TPUv2 where developed around same time. They were certainly announced around the same time at least. Just seems weird to say "it used to," when V100 has been shipping since mid-summer 2017.
I think at least for inference TPUv1 was beating all previously available GPUs by a wide margin. TPUv2 did that for training as well, with the exception of Volta.
>I am surprised the TPU doesn't kick the ass of the NVIDIA chips.

Yeah, I'm a bit disappointed myself. When announced initially, it seemed Google had a huge lead. But they dragged their feet for two years getting it to market, and now NVidia is nipping at their heels already.

I suspect they are using the TPUs internally for competitive advantage, and these are the leftovers they are done with. They're probably using v4 or v5 internally already.

I agree with you that the cost of TPU development probably out ways the number of dollars that Google will earn renting TPUs. The thing is, no one else has a TPU but Google. That doesn't look like it will change any time soon. That means that if you want to run the fastest machine learning models, you have to use Google Cloud. Now, Google doesn't just benefit from the TPUs, they can now sell more customers to come to their cloud. After that starts happening, all of the best machine learning people will have Google Cloud experience. Then when they start something new, they will use what they know: Google Cloud. Also, they will create the tooling that only works with TPUs and gives an advantage you cannot use outside of Google Cloud. So, it will be a net win for Google even if it is more expensive to run a TPU than what they are renting them for.

tl;dr TPU helps Google Clouds' network effect.

Furthermore, computer hardware is not static. Is this a real long term investment by Google?

If they do not continue to improve on process, they will fall behind in just a few years.