Hacker News new | ask | show | jobs
by MaxBarraclough 260 days ago
I'm not sure I follow. Matrix multiplication isn't inherently 'branchy' in a way that we would expect to cause inefficient execution on SIMT (i.e. branch divergence).
1 comments

I think the remark is more about Tensor Cores (or Matrix Cores in AMD lingo) are distributed by SM (and not aside on an interconnect and individually programmable) so on the same SM you have your classical warps (cuda cores) AND the Tensor units and switching between one and the other might be confusing.

My vision of SMs has always been "assume AVX512 is the default ISA" and "tensor cores are another layer aside of this" (kind-of like AMX) and you have this heterogeneous "thing" to program. Don't know if it helps. The CUDA programming model hides a lot and looking at PTX code in nsight-compute is most enlightening.