|
|
|
|
|
by boulos
3036 days ago
|
|
Disclosure: I work on Google Cloud. By the logic above, you would conclude that TPUv1 (the inference-only chip) might have been a mistake, but we’ve been very public about how it “saved us from building lots of datacenters”. That wasn’t ever sold as part of Cloud, so the benefit there is all from the second bit you mentioned: cheaper and more efficient than GPUs at the time. The paper also goes into more detail, but the size of that initial engineering team and time to market were both quite small. For training, before Volta (and kind of Pascal), GPUs were the best option but not particularly efficient. Volta does the same “we should have a single instruction that does lots of math in one shot” by cleverly reusing the existing functional units. That the V100 is a great chip, is a good outcome for the whole industry. But GPUs aren’t (and shouldn’t be) just focused on ML. My bet is that there’s still a decent amount of runway left in specialized chips for ML, just as GPUs carved out their own niche versus CPUs. But again, the “even just for Google” benefit is really enormous so I wouldn’t assume that Cloud has to pay for the entire effort. Could GPU manufacturers improve the cost:performance ratio of ML workloads enough that Google doesn’t have to build TPUs anymore? Perhaps, but like the V100 improvements that would be a great outcome! |
|