Hacker News new | ask | show | jobs
by 6keZbCECT2uB 1091 days ago
The tensor core accelerates mostly matrix operations and is the big block you can see has 4 per SM. Cuda core refers to the thread per SM, which you can see as FP32 or INT32 units, so there are (32*4) per SM on that diagram.

Like you said, tensor core is similar to a special purpose ALU and is at a lower level of abstraction than something with an instruction pointer.