Hacker News new | ask | show | jobs
by loser777 1520 days ago
"Unlike a CPU, when a GPU is waiting for data, it can't just switch to work on something else." https://twitter.com/VadimYuryev/status/1514295693581586434 Is this meant to be qualified with (Apple) GPU? Otherwise it sounds like the literal definition of latency hiding that has been the norm on desktop GPUs (and one of the first things taught to newcomers of the OpenCL/CUDA programming model) for a while.
2 comments

I guess the point here is that GPUs don't have general purpose tasks doing independent work and an OS scheduler. You might end up masking part of the cost of some stalls if you are able to swap in other ready-to-run tasks. The bus latency you're describing doesn't exist when you don't need to copy to a GPU-dedicated GDDR/HBM bank. But while that problem goes away, it sounds like this new one of TLB pressure shows up.

I'm no expert but I suspect the mobile SoCs like Bionic and Snapdragon have the same concept as the M1 Ultra with respect to integrated GPU sharing memory with the apps cores. M1 probably inherited it from Bionic? So some of this work of porting compute software to reflect this environment may have already started. I guess the challenge is that the bar is higher for expectation of GPU performance in a desktop system like the Mac Studio.

I don't know much about the M1.

But I can say for sure that Intel iGPUs, AMD GCN, AMD RDNA, AMD CDNA, and multiple NVidia-generations of GPUs all have hyperthread-like rescheduling of independent workgroups.

In fact, something like 8x wavefronts / warps run in parallel on modern GPUs. When one wavefront / warp stalls due to a memory read/write (or a PCIe read/write), the GPUs universally "hyperthread-out" and hide the latency.

Its "different" from how CPUs do it, but the fundamental principals are the same. (CPUs have a redundant set of registers tracked in a register file. GPUs on the other hand, have a set of registers and the kernel-scheduler (or whatever handles CUDAstreams) carefully assigns those registers to not conflict with any running wavefronts).

-------

The statement so listed is blatantly false, at least for Intel, AMD, and NVidia GPUs. Maybe Apple iGPUs are built different, but I find that unlikely.

> The statement so listed is blatantly false, at least for Intel, AMD, and NVidia GPUs. Maybe Apple iGPUs are built different, but I find that unlikely.

The statement is for Apple GPUs only, that’s the whole point. Software can be easily ported to Metal (in a weekend according to Roblox devs) but until it’s optimised for TBDR it will underperform.

> You might end up masking part of the cost of some stalls if you are able to swap in other ready-to-run tasks.

You'd need SMT to do this for memory stalls, and Apple M1 doesn't use SMT - they have the same amount of logical cores (hardware threads) and physical cores.

Source? Every unified programmable GPU I've seen uses SMT, including the PowerVR GPUs going back to the SGX days. It's core to how they approach modern memory hierarchies.
Looking into it more, AGX2 (like pretty much every fairly high perf modern GPU) is heavily SMT, allowing up to 1024 simultaneous threads per core depending on how many registers each shader invocation needs.

https://rosenzweig.io/blog/asahi-gpu-part-3.html

He says the entire thread group needs to be stalled. The thread group is (on M1) 32 SIMD lanes that have to execute basically the same control flow. Presumably other thread groups can continue executing.