| HN Mirror

I think we're talking past each other to some extent. Putting aside the question of how misleading it is to market a 16 bit multiply as a "TF32" operation, this is all about tradeoffs. The specific tradeoff that these tensor cores make is that in exchange for reduced precision (and a programming model which is even more of a pain than ordinary compute shaders, an astonishing achievement in and of itself), you get a lot more throughput. For certain AI workloads, particularly inference, that tradeoff is well worth it.

Reading between the lines a little, it sounds like your infrastructure is potentially able to exploit a good deal of the available throughput for FP32 workloads. That's great, and I'm happy to see it! However, for workloads that don't need that much precision, the tradeoff might be a lot less advantageous to M1. That may change again if and when Apple opens up lower-level APIs to their hardware, or reverse engineering delivers usable results.