|
|
|
|
|
by xoranth
953 days ago
|
|
Thank you! > ...until they are slower, because GPU's latency hiding mechanism (with occupancy) hides load latencies very well, while CPU just stalls the pipeline on every cache miss for ungodly amounts of time... Is the GPU latency hiding mechanism equivalent to SMT/Hyperthreading, but with more threads per physical core? Or is there more machinery? Also, how akin GPUs "stream multiprocessors"/cores are to CPUs ones at the microarchitectural level? Are they out-of-order? Do they do register renaming? |
|
A "Core" (Apple's term) / "Compute Unit" (AMD) / "Streaming Multiprocessor" (Nvidia) / "Core" (CPU world). This is the basic unit that gets replicated to build smaller/larger GPUs/CPUs
* Each "Core/CU/SM" supports 32-64 waves/simdgroups/warps (amd/apple/nvidia termology), or typically 2 threads (cpu terminology for hyperthreading). ie, this is the unit that has a program counter, and is used to find other work to do when one thread is unavailable. (this blurred on later Nvidia parts with Independent Thread Scheduling.)
* The instruction set typically has a 'vector width'. 4 for SSE/NEON, 8 for AVX, or typically 32 or 64 for GPUs (but can range from 4 to 128)
* Each Core/CU/SM can execute N vector instructions per cycle (2-4 is common in both CPUs and GPUs). For example, both Apple and Nvidia GPUs have 32-wide vectors and can execute 4 vectors of FP32 FMA/cycle. So 128 FPUs total, or 256 FMAs/cycle Each of these FPUs what Nvidia calls a "Core", which is why their core counts are so high.
In short, the terminology exchange rate is 1 "Apple GPU Core" == 128 "Nvidia GPU Cores", on equivalent GPUs.