| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by layla5alive 1954 days ago

This is, itself, a misleading comment!

Yes, "CUDA Core" is (was?) analogous to "FP32 ALU" - (it may have transitioned to FP16 nowadays to double the count?)

Yes, EPYC has 16 FP32 ALU per Core.

But, single GPU "Compute Units" often have 2-4x that (up-to 64 FP32 ALU), and a lot of other differences which when added up at the whole GPU level don't mean they max out anywhere near '64 "cores"' "if you counted a GPUs independent parallel threads of execution like on CPUs"

That's where they couldn't be more different!

Also, for accuracy, vis-a-vis "independent parallel threads of execution like on CPUs", even the 64-core Threadripper has 128, not 64.

And many GPUs with 64 physical "Compute Units" would have at least 640 by the same measure (since many have 10 independent programs which can be resident in hardware at once, compared to SMT2 on Threadripper).

But when you break it down further, while a model GPU Compute Unit from a few years ago might have 64x FP32 ALUs in hardware, that CU has the capacity to schedule and execute single instructions on 10*64 logical threads (= 640 logical threaded instructions in-flight on a single CU at once). On a 64CU chip, that's 40,960 logical floating point instructions in flight at once, each one of these coming from a different logical thread of execution in a SIMT model.

The superscalar CPU cores can also have lots of instructions in flight, but they are more deeply pipelining instructions for fewer threads of execution (focused on single-threaded performance).

This is all a long way of saying that a.) you may not have been so far off when comparing raw ALU counts; but, b.) you couldn't be misrepresenting the facts more when comparing the differences in "parallel threads of execution."

This is a core architectural difference between CPUs and GPUs, and while yes, there are similarities in ALU count, the way the transistors and ALUs are utilized by software is quite different.

To oversimplify, and ignoring performance of applying these different programming models to different compute architectures:

Using an example 64-CU GPU compared to 64-Core EPYC:

64 CU GPU as SIMD: 640 Threads (SMT10) and 4096 FP32 ALUs (SIMD64 - actually 4xSIMD16)

64 Core CPU as SIMD: 128 Threads (SMT2) and 1024 FP32 ALUs (SIMD16 - actually 2xSIMD8)

64 CU GPU as SIMT: 40,960 Threads (10:1 Threads:ALUs) 64 Core CPU as SIMT: 2,048 Threads (2:1 Threads:ALUs)

So, this model 64-CU GPU can schedule 20x more logical hardware threads in a SIMT programming model than the 64-Core EPYC CPU, despite having only 4x the FP32 ALUs.

My final caveat would be that GPUs are over time increasing single-threaded performance, and CPUs are becoming wider. So, in a way, there is some architectural convergence - but it's not to be overstated.

One last note: Niagara was a bit GPU-like, with round-robin thread scheduling, and POWER now has SMT8. But differences remain.

1 comments

ATsch 1954 days ago

Hey, thanks for the detailed comment. I should note I didn't intend to create the impression that GPUs and CPUs architectures were completely interchangeable. That's obviously not true, because of the fixed function hardware alone. The intent was more to give a rough idea of what "CUDA Core" actually means and roughly how those concepts map to what we know in CPUs.

> Yes, "CUDA Core" is (was?) analogous to "FP32 ALU" - (it may have transitioned to FP16 nowadays to double the count?)

It's still FP32 ALUs (NVidia disables FP16 in the driver for gaming cards). The doubling between Turing and Ampere is due to the combined int/fp units being counted as CUDA cores. This also means that for int instructions, the expexted performance is not actually increased, another reason the "core" term is unhelpful.

> But, single GPU "Compute Units" often have 2-4x that (up-to 64 FP32 ALU)

Yep, guess I should have left that AVX2048 joke in there for you. It just wasn't very important IMO.

> Also, for accuracy, vis-a-vis "independent parallel threads of execution like on CPUs", even the 64-core Threadripper has 128, not 64.

I did not consider SMT here because while relevant for keeping the ALUs fed with instructions, they don't increase the amount of raw peak compute possible. Those 128 threads can still only issue 64 avx instructions at any one time, which is what really matters here.