Hacker News new | ask | show | jobs
by Lichtso 1431 days ago
> GPUs are getting better at handling thread divergence in the microarch

That is an interesting point, how does that work (especially with the dynamics of ray tracing)? Do they recombine under utilized wavefronts or something?

3 comments

I'm not aware of anything that improves thread-divergence. NVidia's most recent GPUs have superscalar operations, which is a trick from CPU-land (multiple pipelines operating 2 or more instructions per clock tick). NVidia has an integer-pipeline and a floating-point pipeline, and both can operate simultaneously (ex: for(int i=0; i<100; i++) x *=blah; the "i++" is integer, while the "x *= blah" is floating point, so both operate simultaneously.

CPUs have extremely flexible pipelines: Intel's pipeline 0 and 1 basically can do anything, pipeline 5 can do most stuff but is missing division IIRC (and a few other things). Load/store are done on some other pipelines, etc. etc.

Apple's and AMD's CPU pipelines are more symmetrical and uniform.

NVidia GPUs are the only superscalar ones I can think of, aside from AMD GPU's scalar vs vector split (which isn't really the "superscalar" operation I'm trying to describe).

Starting with Volta, Nvidia GPUs have forward progress guarantee, preventing lockups when there’s thread divergence.

That doesn’t improve the performance of a well behaved and well written compute shader. But avoiding hard hangs IMO deserves the label “improved thread divergence.”

Aren't warps still 32 threads, even though number of threads is skyrocketing, effectively making them proportionately finer granularity? Are things different in AMD land?
Slightly, the older tech is 64 threads/lanes per warp/wavefront. Newer ones are 32 by default but 64 if desired.

Bigger differences are the instruction counter per thread since volta on nvidia (which I think is a terrible feature) and that forward progress guarantees are stronger on nvidia (those are _really_ helpful but expensive).

Nvidia GPUs were 32 threads per warps eight from the start of CUDA with the 8800 GTX.

> which I think is a terrible feature <> those are _really_ helpful but expensive

Guaranteed forward progress is a direct consequence of having an instruction counter per thread???

Or so I thought. How else would an SM be able to know the PC of a group of threads that wasn’t stuck?

> Slightly, the older tech is 64 threads/lanes per warp/wavefront. Newer ones are 32 by default but 64 if desired.

AMD GCN was 64 threads/wavefront. NVidia always was 32 threads/warp.

AMD's newest consumer cards RDNA and RDNA2 are 32 threads/wavefront. However, GCN lives on with CDNA (MI200 supercomputer chips), with 64 threads/wavefront architecture.

There is tech in late model GPUs to keep all same divergent threads in the same warp/wavefront.