Hacker News new | ask | show | jobs
by zmmmmm 959 days ago
Does apple document exactly how many actual true cores there are inside their GPUs? It is always confusing they say "40 core GPU" but I assume these are shader cores which each inside them can execute (per the video) "many thousands" of parallel execution paths.

So how does one translate to an equivalent in "CUDA cores" type terminology?

5 comments

What Apple calls a GPU core seems to be roughly the same as what Nvidia calls a “stream multiprocessor”.

For example a 1080 GTX GPU has 20 stream multiprocessors (SM), each containing 128 cores, each of which supports 16 threads.

Meanwhile Apple describes the M1 GPU as having 8 cores, where “each core is split into 16 Execution Units, which each contain eight Arithmetic Logic Units (ALUs). In total, the M1 GPU contains up to 128 Execution units or 1024 ALUs, which Apple says can execute up to 24,576 threads simultaneously and which have a maximum floating point (FP32) performance of 2.6 TFLOPs.”

So one option to get a single number for a rough comparison is to count threads. The 1080 GTX supports 40,960 threads while the M1 supports 24,576 threads.

There’s obviously a lot more to a GPU — for starters, varying clock speeds, ALUs can have different capabilities, memory bandwidth, etc. But at least counting threads gives a better idea of the processing bandwidth than talking about cores.

Just for clarification: The 1080 has 20 SMs with 128 FPUs each. Each FPU can perform 2 FLOPs per cycle (fused multiply adds). Combined with the frequency of 1607 MHz we land on the advertised 8.2 TFlop/s.

The fact that each SM can support 1024 threads (that's the maximum blocksize of CUDA on that card) doesn't do much for the theoretical flops. Only a fraction of those threads can be active at a time. The others are idling or waiting on their memory requests. This hides a lot of the i/o latency.

For sure. Just counting threads doesn't give anything like a complete picture of performance.

It's still somewhat interesting because threads are a low-level programming primitive. If you can come up with work for 40k simultaneous threads, you can use the GPU effectively. For some tasks this parallelization is obvious (a HD video frame has 2 million pixels and shading them independently is trivial), and of course often it's anything but.

If the architecture is vastly different, these comparisons become sort of meaningless, though. The ultimate performance is determined by all the tiny little bottlenecks, like how quickly the architecture can move data between different types of cores, memory, cache, etc.

Apple has always been really good at parallelism, which is why they get so much performance from less power consumption.

Yeah, it’s completely meaningless to compare but HN loves specs & seemingly hasn’t learned—after about 15 years of it being true—that direct spec comparisons are meaningless.

I see if with every major product announcement, the worst are usually Apple threads but it’s not constrained only there.

Question from a GPU novice. I presume a thread is the individual calculation perform on one element of the vector? Can the 128 cores, or 20 SM perform different operations at the same time or all 24,576 threads perform the same operation at the same time, on a vector of data of length of 24,576?
"128 * the-number-of-cores" of threads can make progress truly in parallel (at the same time).

24,576 threads (or however many, I didn't validate the number and it depends on the occupancy, which depends on thread resource usage, like registers => depends on the shader program code) is how many threads can be executed concurrently (as opposed to in parallel), as in, how many of them can simultaneously reside on the GPU. A subset of those at any time are actually executed in parallel, the rest are idle.

You can think of this situation as follows using an analogy with a CPU and an OS:

1. 128 * the-number-of-cores is the number of CPU cores(*1)

2. 24,576 threads is the number of threads in the system that the OS is switching between

Major differences with the GPU:

3. On a CPU context switch (getting a thread off the core, waking up a different thread, restoring the context, and proceeding) takes about 2,000 cycles. On a GPU _from the analogy_ that kind of thread switching takes ~1-10 cycles depending on the exact GPU design and various other details.

4. In CPU/OS world the context switching and scheduling on the OS side is done mostly in software, as the OS is indeed software. In GPU's case the scheduler and all the switching is implemented as fixed function hardware finely permeating the GPU design.

5. In CPU/OS world those 2,000 cycles per context switch is so much larger than a roundtrip to DRAM while executing a load instruction that happened to miss in all caches - which is about 400-800 cycles or so depending on the design - that OS never switches threads to hide latencies of loads, it's pointless. As far as performance is concerned (as opposed to maintaining the illusion of parallel execution of all programs on the computer), the thread switching is used to hide the latency of IO - non-volatile storage access, network access, user input, etc. (which takes millions of cycles or more - so it makes sense).

In the GPU world the switching is so fast, that the hardware scheduler absolutely does switch from thread to thread to hide latencies of loads (even the ones hitting in half of the caches, if that happens), in fact, hiding these latencies and thus keeping ALUs fed is the whole point of this basic design of pretty much all programmable GPUs that there ever were.

6. In real world CPU/OS, the threads that aren't running at the time reside (their local variables, etc) in the memory hierarchy, technically, some of it ends up in caches, but ultimately, the bulk of it on a loaded system is in system DRAM. On a GPU, or I suppose by now we have to say, on a traditional GPU, these resident threads (their local variables, etc) reside in on-chip SRAM that is a part of the GPU cores (not even in a chunk on a side, but close to execution units in many small chunks, one per core). While the amount of DRAM (CPU/OS) is a) huge, gigabytes, and b) easily configurable, the amount of thread state the GPU scheduler is shuffling around is measured typically in hundreds of KBs per GPU Core (so on the order of about "a few MBs" per GPU), and the equally sized SRAM storing this state is completely hardwired in the silicon design of the GPU and not configurable at all.

Hope that helps!

footnotes (*1) a better analogy would be not "number of CPU cores", but "number-of-CPU-cores * SMT(HT) * number-of-lanes-in-AVX-registers", where number-of-lanes-in-AVX-registers is basically "AVX-register-width / 32" for FP32 processing which (the latter) yields about ~8 give or take 2x depending on the processor model. Whether to include SMT(HT) multiplier (2) in this analogy is also murky, there is an argument to be made for yes, and an argument to be made for no, and depends on the exact GPU design in question.

So, in NVIDIA parlance, my Skylake laptop would have 128 "cuda cores"?

128 = 4 (physical cores) * 2 (hyperthreading) * 8 (AVX2 f32 lanes) * 2 (floating point ports per core)

Sorta, yeah!

Also, your "128 cuda cores" of Skylake variety run at higher frequencies and work off of much bigger caches, so they are faster (in serial manner)...

...until they are slower, because GPU's latency hiding mechanism (with occupancy) hides load latencies very well, while CPU just stalls the pipeline on every cache miss for ungodly amounts of time...

...until they are faster again when the shader program uses a lot of registers and GPU occupancy drops to the floor and latency hiding stops hiding that well.

But core counts - yes, more or less.

Thank you!

> ...until they are slower, because GPU's latency hiding mechanism (with occupancy) hides load latencies very well, while CPU just stalls the pipeline on every cache miss for ungodly amounts of time...

Is the GPU latency hiding mechanism equivalent to SMT/Hyperthreading, but with more threads per physical core? Or is there more machinery?

Also, how akin GPUs "stream multiprocessors"/cores are to CPUs ones at the microarchitectural level? Are they out-of-order? Do they do register renaming?

Not quite. These floating-point EUs are shared between both threads of the physical core. I would rather say 64 CUDA cores.
Even moreso we shouldnt assume they have similar architectural layouts.
I think a better comparison is to look at floating point performance. For example, the 10 core M2 GPU does 3.6 TFLOPS (FP32) while an RTX 4060 does 15 TFLOPS and an RTX 4090 82.58
Unfortunately, hardly. Ampere's (Nvidia 3000 series), Ada's (Nvidia 4000 series), and RNDA 3's (AMD 7000 series) GPUs have doubled up their FP32 units in ways that differ in implementation (between AMD and Nvidia) but are relatively similarly poor in their ability to be utilized properly at rates much higher than pre-doubling (Nvidia is doing better than AMD in that, but very far from great).

The formal TFLOPS comparison as a result would be most sensible between pre-M3 designs, AMD 6000 series (RNDA 2), and Nvidia's 2000 series (Turing). After that it gets really murky with AMD's "TFLOPS" looking nearly 2x more than they are actually worth by the standards of prior architectures, followed by Nvidia (some coefficient lower than 2, but still high), followed by M3 which from the looks of it is basically 1.0x on this scale, so long as we're talking FP32 TFLOPS specifically as those are formally defined.

You can see this effect the easiest by comparing perf & TFLOPS of AMD 6000 series and Nvidia 3000 series - they have released nearly at the same time, but AMD 6000 is one gen before the "near-fake-doubling", while Nvidia's 3000 series is the first gen with the "close-to-fake-doubling": with a little effort you'll find GPUs between these two that perform very similarly (and have very similar DRAM bandwidth), but Ampere's counterpart has almost 2x the FP32 TFLOPS.

Those statements have to be made carefully. A lot of the time the GPU is memory-bandwidth bound, so a increase in FLOPS does nothing. Doesn't mean they're worthless.
Even if you're not memory bandwidth bound, leveraging these 2x FLOPs on recent designs is hard, often due to issues like register bank clashes.

They are low utilization, but apparently still worth it because process node changes have made more ALUs take relatively little area. So doubling the ALU count, even with low utilization is still apparently an overall benefit (ie, there wasn't something better to spend that die space on).

Well if you're not memory bandwidth bound, you're compute bound by definition.
Not really, there is no such definition that you're referring to.

Perf of a GPU can be limited by any one of the thousand little things within the micro-architectural organization of the GPU in question, any on-chip path can become the bottleneck:

1. DRAM bandwidth

2. ALU counts

3. Occupancy

4. Instruction issue port counts

5. Quality of Warp scheduling (the scheduling problem)

6. Operand delivery

7. Any given cache bandwidth

8. Register file bandwidth (SRAM port counts)

9. Head of the line blocking in one of the many queues / paths in the design whatever that path is responsible for:

- sending memory request from the instruction processing pipelines to the memory hierarchy - or sending the reply with the data payload back, - or doing the same but with the texture filtering block (rather than memory H), - or the path that parses GPU commands from command buffers created by the driver, - or the path that subsequently processes those already decoded commands and performs on-chip resource allocation, warp creation / tear down, all of which need to be able to spawn the work further down fast enough to keep the rest of the design fed;

and so on and so on and so on.

By the time a high quality design is fully finished, matured, and successful enough on the market to show up on everyone's radar outside of the hardware design space, due to the commonly occurring ratios of costs of solutions for these various problems above, it usually ends up being 1, 2, or 3, but that's experimental data + statistics + survivorship bias, there is no "definition" that that's the case.

Further, what's "commonly occurring" is changing over time as designs drift into different operational areas in the phase space of operating modes as they pick out low-hanging fruits, science and experience behind the micro-architecture grows, common workloads change in nature, and new process nodes with new properties become the norm. Doubling up of F32 ALUs in Ampere is a good example of that, that was done in a way that changed the typical ratios substantially. And now M3 threw a giant wrench into incumbent statistics of relationships between (3) and the rest of the limiters, as well as between (3) and what's actionable for a GPU program developer to do to mend it.

You can be low on DRAM bandwidth util and ALU util at the same time. How would that be if there were no other limiters?

Generally, a component X of a computer system needs to be a limiter Y% of the time where Y equals the portion of the total cost of the system X is responsible for.

The principle is the easiest to apply in a "calculus of variations" manner: if doubling the key performance metric of X results in an increase of the cost of the entire system as a whole by 5%, but how often X is the limiter drops from 10% of the time to 5% of the time, doing the doubling would bring the design quite close to proper balance wrt. X.

Things that are cheap to beef up are relatively rare a limiter in well-designed systems as a result. Things that are expensive to beef up are often the limiter. What is and isn't expensive to do depends heavily on where in the design space the current design is at and where the technology is at, all of which is changing over time.

FP32 was cheap to double up in Ampere, so they doubled it up, even though that provided only relativelyl small performance improvement. But now as a result, FP32 is very rarely a limiter (in Ampere and Ada). That doesn't automatically mean that these designs are "gimped" in DRAM bandwidth or anything of the kind. Rather, the whole perception that a good GPU design just gotta be ALU limited all the time is nothing but a mistaken perception, just like "it's either ALU limited or DRAM bandwidth limited by definition" also is just untrue. See "occupancy limited" for a prime example.

While FP32 non tensor flops at least looks comparable, FP16/BF16 with tensor core(nowadays a default for any neural network including LLM) at 330 TFlops/s blows away M2.
Many of us still use our graphics chips for graphics, where FP32 is king.
Not on mobile, which is where Apple GPUs come from. FP32 is not necessary for many computations related to graphics, it is just simpler to deal with one data type.
Are you claiming that people aren't using FP32 on mobile, or are you claiming that people are using FP32 on mobile but could technically have gotten away with FP16?

If it's the latter, it's still correct to say that FP32 is king in mobile graphics.

Unless people explicitly ask for high or at least medium precision in the code, they probably will get FP16 floats on a mobile chip, which is totally fine for most fragment shader calculations.
Each Apple core (heh) has 128 FPUs so 40 cores would be akin to 5120 CUDA "cores".
Not quite, because Nvidia counts dual issue as a flat doubling of "core" (which previously you could accurately call a vector lane) count.
> So how does one translate to an equivalent in "CUDA cores" type terminology?

I don’t really think we can, even if we knew exactly what is in a M3 GPU core, which we don’t. Both architectures are very different, and different again from AMD GPUs. We have to count Tflops.

Nvidia "cheat" by counting approximately ALUs.