Hacker News new | ask | show | jobs
by oddity 1431 days ago
The difference is much more nuanced than this. A modern GPU can (and probably does) do most of what you've listed for a CPU. Speculative execution and branch prediction are a bit less likely to be invested in (because they don't need it as much due to oversubscription), but that's increasingly true for CPUs as well for high-efficiency cores. The difference (at a category vs category level and not specific microarch) is mostly a matter of tuning for particular workloads. I'm increasingly souring on SIMD/SIMT being a useful distinction now that bleeding-edge CPUs are widening in the microarch and bleeding-edge GPUs are getting better at handling thread divergence in the microarch. There is a difference, certainly, but it's difficult to describe in a few bullet points.

GPUs are more likely to have more exotic features than you'll see on a CPU to deal with things like thread coordination and cache coherence, but there's nothing fundamentally stopping CPUs from adding that (or wanting that) as well.

1 comments

> GPUs are getting better at handling thread divergence in the microarch

That is an interesting point, how does that work (especially with the dynamics of ray tracing)? Do they recombine under utilized wavefronts or something?

I'm not aware of anything that improves thread-divergence. NVidia's most recent GPUs have superscalar operations, which is a trick from CPU-land (multiple pipelines operating 2 or more instructions per clock tick). NVidia has an integer-pipeline and a floating-point pipeline, and both can operate simultaneously (ex: for(int i=0; i<100; i++) x *=blah; the "i++" is integer, while the "x *= blah" is floating point, so both operate simultaneously.

CPUs have extremely flexible pipelines: Intel's pipeline 0 and 1 basically can do anything, pipeline 5 can do most stuff but is missing division IIRC (and a few other things). Load/store are done on some other pipelines, etc. etc.

Apple's and AMD's CPU pipelines are more symmetrical and uniform.

NVidia GPUs are the only superscalar ones I can think of, aside from AMD GPU's scalar vs vector split (which isn't really the "superscalar" operation I'm trying to describe).

Starting with Volta, Nvidia GPUs have forward progress guarantee, preventing lockups when there’s thread divergence.

That doesn’t improve the performance of a well behaved and well written compute shader. But avoiding hard hangs IMO deserves the label “improved thread divergence.”

Aren't warps still 32 threads, even though number of threads is skyrocketing, effectively making them proportionately finer granularity? Are things different in AMD land?
Slightly, the older tech is 64 threads/lanes per warp/wavefront. Newer ones are 32 by default but 64 if desired.

Bigger differences are the instruction counter per thread since volta on nvidia (which I think is a terrible feature) and that forward progress guarantees are stronger on nvidia (those are _really_ helpful but expensive).

Nvidia GPUs were 32 threads per warps eight from the start of CUDA with the 8800 GTX.

> which I think is a terrible feature <> those are _really_ helpful but expensive

Guaranteed forward progress is a direct consequence of having an instruction counter per thread???

Or so I thought. How else would an SM be able to know the PC of a group of threads that wasn’t stuck?

> Slightly, the older tech is 64 threads/lanes per warp/wavefront. Newer ones are 32 by default but 64 if desired.

AMD GCN was 64 threads/wavefront. NVidia always was 32 threads/warp.

AMD's newest consumer cards RDNA and RDNA2 are 32 threads/wavefront. However, GCN lives on with CDNA (MI200 supercomputer chips), with 64 threads/wavefront architecture.

There is tech in late model GPUs to keep all same divergent threads in the same warp/wavefront.