| HN Mirror

I think the remark is more about Tensor Cores (or Matrix Cores in AMD lingo) are distributed by SM (and not aside on an interconnect and individually programmable) so on the same SM you have your classical warps (cuda cores) AND the Tensor units and switching between one and the other might be confusing.

My vision of SMs has always been "assume AVX512 is the default ISA" and "tensor cores are another layer aside of this" (kind-of like AMX) and you have this heterogeneous "thing" to program. Don't know if it helps. The CUDA programming model hides a lot and looking at PTX code in nsight-compute is most enlightening.