Hacker News new | ask | show | jobs
by ribit 1491 days ago
And yet somehow Apples GPU ALUs are more efficient at 3.8 watts per TFLOP. Mind, I am not talking about specialized matrix multiplication units that have a different internal organization and can do things like matrix multiplication much more efficiently, but about basic general-purpose GPU ALUs.

The comparison of efficiency between Apple and Nvidia here is a bit misleading because one compares Apples general-purpose ALUs to Nvidia’s specialized ALUs. For a more direct efficiency comparison, one would need to compare the Tensor Cores against the AMX or ANE coprocessors.

As to how Apple achieves such high efficiency, nobody knows. The fact that they are on 5nm node might help, but there must be something special about the ALU design as well. My speculation is that they are wider and much more simpler than in other GPUs, which directly translates to efficiency wins.

1 comments

How they do it at a conceptual level isn't a big secret: they don't need to minimize die area the way other companies do. For Apple, the die is just part of a chip that is part of the larger system they sell that they can amortize the cost over. nVidia doesn't have a system to do that with so their natural inclination is to lean towards keeping the die size as small as possible and just overclock the hell out of it. (right there is the 'trick': Apple can afford to do things that chew up die space that nVidia and others can't while maintaining their profit margins) Being a process generation ahead is also a rather huge thing too. (which is another thing they can amortize the cost of over large numbers of complete systems and mobile devices which their competitors can't)

Also related: Apple designs their hardware to do just what they want it to while everyone else is designing for a more general use case. This also costs die area, IP licensing fees etc.

But how does that apply to GPU ALUs? Looking at M1 die shots, they are comparatively tiny, and when comparing to other vendors, it doesn't seem like Apple is dedicated more logic space to the GPU. The M1 die is roughly 120mm2, an Nvidia Turing TU117 (GTX 1650) is roughly 200m2. Both feature the same amount of GPU ALUs (1024 32-bit units). And of course, M1's 5nm is around 5-6 times denser than Turing's 12nm, but M1 is an entire SoC with all kinds of components — not to mention a huge cache — the GPU takes maybe 20% of the die (let's say 1/3 if you also count in the display controller and memory controllers). All in all, the amount of normalised die space dedicated to GPU ALUs seems comparable.

Of course, my perspective here might be extremely naive, I know very little about semiconductor technology, just trying to understand the principal design differences.

Also I thought Apple is adding a large slice of cache. When you look at the 3D-V cache on Ryzen for performance (+15%?), this has a large impact. And because they sell expensive stuff, they can afford to build expensive CPUs.
Cache doesn't matter for pure ALU efficiency though. I mean, I did tests on long dependent chains of FMAs, the only memory touched there are two internal registers.