|
|
|
|
|
by wyldfire
1520 days ago
|
|
I guess the point here is that GPUs don't have general purpose tasks doing independent work and an OS scheduler. You might end up masking part of the cost of some stalls if you are able to swap in other ready-to-run tasks. The bus latency you're describing doesn't exist when you don't need to copy to a GPU-dedicated GDDR/HBM bank. But while that problem goes away, it sounds like this new one of TLB pressure shows up. I'm no expert but I suspect the mobile SoCs like Bionic and Snapdragon have the same concept as the M1 Ultra with respect to integrated GPU sharing memory with the apps cores. M1 probably inherited it from Bionic? So some of this work of porting compute software to reflect this environment may have already started. I guess the challenge is that the bar is higher for expectation of GPU performance in a desktop system like the Mac Studio. |
|
But I can say for sure that Intel iGPUs, AMD GCN, AMD RDNA, AMD CDNA, and multiple NVidia-generations of GPUs all have hyperthread-like rescheduling of independent workgroups.
In fact, something like 8x wavefronts / warps run in parallel on modern GPUs. When one wavefront / warp stalls due to a memory read/write (or a PCIe read/write), the GPUs universally "hyperthread-out" and hide the latency.
Its "different" from how CPUs do it, but the fundamental principals are the same. (CPUs have a redundant set of registers tracked in a register file. GPUs on the other hand, have a set of registers and the kernel-scheduler (or whatever handles CUDAstreams) carefully assigns those registers to not conflict with any running wavefronts).
-------
The statement so listed is blatantly false, at least for Intel, AMD, and NVidia GPUs. Maybe Apple iGPUs are built different, but I find that unlikely.