Hacker News new | ask | show | jobs
by 6keZbCECT2uB 1101 days ago
I'm not sure about the 4090, but most of the GPUs I use have a warp size of 32, and warp divergence affects only up to those 32 threads. If you have a branch and all threads agree, you only walk down one branch.

My mental model is a bit more like you have collections of warps in a block, and all warps in a block get scheduled onto an SM. Different GPU architectures allow for different numbers of warps to be simultaneously active or inactive, and each warp has its own instruction pointer and can be suspended while waiting for things like memory. I found the picture on pg 22 here really helpful: https://images.nvidia.com/aem-dam/en-zz/Solutions/data-cente...

Note that although there's 4 schedulers, on the A100, they don't dispatch every cycle iirc.

1 comments

Those are tensor cores, not cuda cores. They're used for AI rather than general compute/shaders. The 4090 has 512 of those.

Correct me if I'm wrong, but as far as I can tell tensor cores are just accelerators. They can't do general compute: no branch or jump.

The tensor core accelerates mostly matrix operations and is the big block you can see has 4 per SM. Cuda core refers to the thread per SM, which you can see as FP32 or INT32 units, so there are (32*4) per SM on that diagram.

Like you said, tensor core is similar to a special purpose ALU and is at a lower level of abstraction than something with an instruction pointer.