Hacker News new | ask | show | jobs
by ATsch 1954 days ago
"CUDA Core" is a misleading marketing term. It merely counts the number of FP32 ALUs. If you measured CPUs this way, you'd have up to 16 "cores" per core. If you count it that way, that makes this CPU is roughly as powerful as the GTX 400 series with it's "512 cores". A brief look at benchmarks seems to confirm that.

Conversely if you counted a GPUs independent parallel threads of execution like on CPUs, you'd max out at a very comparable 64 "cores".

2 comments

SIMT is a much more convenient programming model than SIMD for wide compute, though. So much so that I think the marketing around CUDA cores isn't merely reasonable but for many applications actually strikes closer to the truth than counting ALUs or independent threads.

If you count ALUs, you see that the CPU has many of them, but you don't see how difficult it is to chunk up data to keep those fed.

If you count independent threads, you see that the GPU has few of them, but you don't see how it conceptually has many threads, which simplifies the programming of each thread while gracefully degrading only in proportion to how much branching you actually use.

I definitely agree with it being a useful model, after all that is the the way GPUs are presented to programmers. But when getting into detailed performance comparisons, especially with CPUs, that simplified model breaks down quickly and becomes a hindrance. Which is why I think it's unfortunate that NVidia (deliberately) uses the term "core" that leads people to believe they can make a direct comparison to CPUs. Comparisons that coincidentally lead people to believe GPUs are much more special than they actually are.
In detailed comparisons all simplified models break down. Tallying max theoretical compute is only appropriate if you're going to put in the effort to actually use it, which is exceedingly rare, even in compute kernels that have supposedly had lots of love and attention already paid to them. So the human factor has to be included in the model and the human factor consistently de-rates SIMD more than it does SIMT.

I realize that under the covers this is more of a compiler/language thing than a compute model thing, but for whatever reason I just don't see much SIMT code targeting CPUs, so again, human factor.

My primary objection is not to the SIMT model, but NVidia reusing a term with an established meaning ("core") for something completely different and incomparable. Other companies terms like "execution units", "compute units" and "stream processors" (although perhaps not as much the last) are much more truthful about the nature of GPUs without hindering the programming model at all.
From the standpoint of a programmer who doesn't want to suffer through constant DIY chunking and packing (very close to my personal definition of hell), a CUDA core looks a lot like a CPU core. From the point of view of someone not writing the code, a CUDA core looks like merely another FPU.
This is, itself, a misleading comment!

Yes, "CUDA Core" is (was?) analogous to "FP32 ALU" - (it may have transitioned to FP16 nowadays to double the count?)

Yes, EPYC has 16 FP32 ALU per Core.

But, single GPU "Compute Units" often have 2-4x that (up-to 64 FP32 ALU), and a lot of other differences which when added up at the whole GPU level don't mean they max out anywhere near '64 "cores"' "if you counted a GPUs independent parallel threads of execution like on CPUs"

That's where they couldn't be more different!

Also, for accuracy, vis-a-vis "independent parallel threads of execution like on CPUs", even the 64-core Threadripper has 128, not 64.

And many GPUs with 64 physical "Compute Units" would have at least 640 by the same measure (since many have 10 independent programs which can be resident in hardware at once, compared to SMT2 on Threadripper).

But when you break it down further, while a model GPU Compute Unit from a few years ago might have 64x FP32 ALUs in hardware, that CU has the capacity to schedule and execute single instructions on 10*64 logical threads (= 640 logical threaded instructions in-flight on a single CU at once). On a 64CU chip, that's 40,960 logical floating point instructions in flight at once, each one of these coming from a different logical thread of execution in a SIMT model.

The superscalar CPU cores can also have lots of instructions in flight, but they are more deeply pipelining instructions for fewer threads of execution (focused on single-threaded performance).

This is all a long way of saying that a.) you may not have been so far off when comparing raw ALU counts; but, b.) you couldn't be misrepresenting the facts more when comparing the differences in "parallel threads of execution."

This is a core architectural difference between CPUs and GPUs, and while yes, there are similarities in ALU count, the way the transistors and ALUs are utilized by software is quite different.

To oversimplify, and ignoring performance of applying these different programming models to different compute architectures:

Using an example 64-CU GPU compared to 64-Core EPYC:

64 CU GPU as SIMD: 640 Threads (SMT10) and 4096 FP32 ALUs (SIMD64 - actually 4xSIMD16)

64 Core CPU as SIMD: 128 Threads (SMT2) and 1024 FP32 ALUs (SIMD16 - actually 2xSIMD8)

64 CU GPU as SIMT: 40,960 Threads (10:1 Threads:ALUs) 64 Core CPU as SIMT: 2,048 Threads (2:1 Threads:ALUs)

So, this model 64-CU GPU can schedule 20x more logical hardware threads in a SIMT programming model than the 64-Core EPYC CPU, despite having only 4x the FP32 ALUs.

My final caveat would be that GPUs are over time increasing single-threaded performance, and CPUs are becoming wider. So, in a way, there is some architectural convergence - but it's not to be overstated.

One last note: Niagara was a bit GPU-like, with round-robin thread scheduling, and POWER now has SMT8. But differences remain.

Hey, thanks for the detailed comment. I should note I didn't intend to create the impression that GPUs and CPUs architectures were completely interchangeable. That's obviously not true, because of the fixed function hardware alone. The intent was more to give a rough idea of what "CUDA Core" actually means and roughly how those concepts map to what we know in CPUs.

> Yes, "CUDA Core" is (was?) analogous to "FP32 ALU" - (it may have transitioned to FP16 nowadays to double the count?)

It's still FP32 ALUs (NVidia disables FP16 in the driver for gaming cards). The doubling between Turing and Ampere is due to the combined int/fp units being counted as CUDA cores. This also means that for int instructions, the expexted performance is not actually increased, another reason the "core" term is unhelpful.

> But, single GPU "Compute Units" often have 2-4x that (up-to 64 FP32 ALU)

Yep, guess I should have left that AVX2048 joke in there for you. It just wasn't very important IMO.

> Also, for accuracy, vis-a-vis "independent parallel threads of execution like on CPUs", even the 64-core Threadripper has 128, not 64.

I did not consider SMT here because while relevant for keeping the ALUs fed with instructions, they don't increase the amount of raw peak compute possible. Those 128 threads can still only issue 64 avx instructions at any one time, which is what really matters here.