|
|
|
|
|
by magic_at_enimai
1580 days ago
|
|
So with Tensorcores you use TF32 which is more like FP19-ish and the marketing makes you think you get 8x the performance. But if you want actual FP32 precision you will need something like [1] but then your performance in the Tensorcore path is _only_ 2X faster than the SIMT path. I'll leave the prefix sum for other devs who know more :D https://github.com/NVIDIA/cutlass/blob/master/examples/27_am... //part of nod.ai/shark team |
|
Reading between the lines a little, it sounds like your infrastructure is potentially able to exploit a good deal of the available throughput for FP32 workloads. That's great, and I'm happy to see it! However, for workloads that don't need that much precision, the tradeoff might be a lot less advantageous to M1. That may change again if and when Apple opens up lower-level APIs to their hardware, or reverse engineering delivers usable results.