Hacker News new | ask | show | jobs
by p1esk 2605 days ago
Nvidia could have released a DL specific chip a long time ago, if they wanted to. I’m not sure why they haven’t (market not big enough?), but they probably will at some point.
3 comments

They release a "data center" specific version of their gpu with slightly improved stats compared to the consumer models, and 10x the price... (And include a "no data centers" clause in the consumer model terms of use.)
(I work at Google on compilers for ml, including compilers for Nvidia gpus.)

Devices like the v100 and t4 are ml-specific chips. You can do graphics stuff on them, but that doesn't mean that Nvidia if leaving a ton of ml performance on the table by including that capability. Indeed there may be economies of scale for them in having fewer architectures to support.

They aren't dumb. :)

V100 has 640 tensor cores, and 5k general FP32/64 cores. Most of DL computation is done by tensor cores. Can you imagine how much faster it would get if they released a chip with say 10k tensor cores?
> Can you imagine how much faster it would get if they released a chip with say 10k tensor cores?

I can, actually. :) Adding 10k tensor cores to a GPU would not make it run much faster, and would be prohibitive in terms of die space. Moreover getting rid of the 5,000 FP32 cores would slow down DL workloads significantly.

The 640 Tensor Cores vs 5,000 F32 cores comparison is misleading, because they are not measuring the same thing.

An "FP32/64 core" corresponds to a functional unit on the GPU streaming multiprocessor (SM) which is capable of doing one scalar FP32 operation. One FLOP, or maybe two if you are doing an FMA. V100 has 5120 FP32 units and 2560 FP64 units.

In contrast, a "Tensor Core" corresponds to a functional unit on the SM which is capable of doing 64 FMAs per clock. That is, a Tensor Core does 64 times as much work as an FP32 core. Integrated circuits aren't magic, if you're doing 64 times as much work, you need more die space.

Moreover, there is nothing to say that nvidia isn't able to use some of the same circuits for both the fp32 and tensor core operations. If they are able to do this (I expect they are) then reducing one does not necessarily make space for the other.

Increasing the number of tensor core flops by a factor of 20 (500 -> 10,000) would not make the GPU 20x faster, probably not even 2x faster. This is because you will quickly run into GPU memory bandwidth limitations. This is not a simple problem to solve, nvidia GPUs are already pushing what is possible with HBM.

Lastly, although you're correct that, in terms of number of flops, most DL computation is done by tensor cores (if you've written your application in fp16), that doesn't mean we could get rid of the f32 compute units, or even that significantly reducing their number would have minimal effect on our models. Recall Amdahl's law. We usually think about it in terms of speedups, but it applies equally well in terms of slowdowns. If even 10% of our time is spent doing f32 compute, and we make it 10x slower...well, you can do the math. https://en.wikipedia.org/wiki/Amdahl%27s_law

Indeed, I was just looking at an fp16 tensor-core cudnn kernel yesterday, and even it did a significant amount of fp32 compute.

The implicit argument I read in parent post is that nvidia could build a significantly better DL chip "simply" by changing the quantities of different functional units on the GPU. This is predicated on nvidia being quite bad at their core competency of designing hardware, despite their being the market leader in DL hardware. It's kind of staggering to me how quickly nonexperts jump to this conclusion.

Here's a talk I gave at cppcon about much of this (note that it's pre Volta). https://www.youtube.com/watch?v=KHa-OSrZPGo&t=1s

Thank you for the detailed answer.

I think your main point is that memory bandwidth would prevent the performance speedup. Are V100s memory bound when executing F16 ops on tensor cores?

Second, do we really need dedicated FP32 cores for DL? Tensor cores accumulate in FP32 (is that what you meant when you said they did a significant amount of FP32 compute?), and recent papers indicate we’re moving towards 8 bit training [1]. Besides, do TPUs use dedicated FP32 hw?

Finally, if the memory bandwidth is indeed the bottleneck, perhaps all that die area from FP32 and especially FP64 cores could be used for massive amount of cache.

[1] https://arxiv.org/abs/1805.11046

V100s are often memory bound when using tensor cores, yes. But I guess my point is broader than that. There is a "right shape" for hardware that wants to excel at a particular workload, depending on the arithmetic intensity, degree of temporal locality, and so on. The point is that you usually can't just turn up one dimension to eleven, it's not usually that simple.

For example, massively increasing the GPU last level cache size would not have the effect of increasing memory bw much on most workloads, because cache only helps when you have temporal locality and gpus like to stream through many GB of data.

This is covered in Hennessy and Patterson if you're curious to learn more. I also talk about it some in the video I linked above.

(Also I doubt that getting rid of f64 support would be a significant die size win. I notice that v100 has, in their marketing speak, twice the fp32 cores as fp64 cores. What do you think are the chances that Nvidia decided a priori this is the optimal ratio? What if instead they are sharing resources between these functional units, at a ratio of two to one?)

To the question of, do you really need fp32 cores, I am not aware of any "widely deployed" GPU model today that does not do significant fp32 work. Perhaps there is research which suggests this isn't necessary! But that is a different thing than we were talking about here, that Nvidia could somehow make a much better chip for the things people are doing today.

I don't want to speak to the question of whether TPUs have f32 hardware, because I'm afraid of saying something that might not be public. But I think the answer to your question can easily be found by some searching and is probably even in the public docs.

I’d really like to know this as well. NVidia has became very AI centric company but this has been a huge blind spot. GPUs with power and chip real estate wasted on rendering hardware are unnecessary and relics of legacy for deep learning. Why they haven’t yet designed CUDA++ only chip yet?
Probably because most of their customers don’t work with Google-like problems, and will not buy chips that are 10x faster on paper but 10 times slower on the problems they DO have, at 10x the price...