|
|
|
|
|
by 6keZbCECT2uB
1101 days ago
|
|
I'm not sure about the 4090, but most of the GPUs I use have a warp size of 32, and warp divergence affects only up to those 32 threads. If you have a branch and all threads agree, you only walk down one branch. My mental model is a bit more like you have collections of warps in a block, and all warps in a block get scheduled onto an SM. Different GPU architectures allow for different numbers of warps to be simultaneously active or inactive, and each warp has its own instruction pointer and can be suspended while waiting for things like memory. I found the picture on pg 22 here really helpful: https://images.nvidia.com/aem-dam/en-zz/Solutions/data-cente... Note that although there's 4 schedulers, on the A100, they don't dispatch every cycle iirc. |
|
Correct me if I'm wrong, but as far as I can tell tensor cores are just accelerators. They can't do general compute: no branch or jump.