Hacker News new | ask | show | jobs
by magic_at_enimai 1571 days ago
Here are the matmul sizes for the MiniLM model used for inference: https://github.com/mmperf/mmperf/blob/main/benchmark_sizes/b...

These are the matmul sizes for the BERT training workload https://github.com/mmperf/mmperf/blob/main/benchmark_sizes/b...

Yes we use the latest MoltenVK (1.3.204.0) installed in the system.

I will let @noxa and other IREE devs chime in on the SPIR-V path but we do support prefix sums etc in the GPU path.

//part of nod.ai team.

1 comments

Thanks for the matmul sizes, but the question I am more interested in is precision. Matrix multiply throughput can be dramatically higher for tensor cores[1] than normal shader ALU, specifically for reduced precision arithmetic. I'm wondering to what extent that's accessible on the M1, and to what extent IREE can address them.

Regarding prefix sum, the specific question I'm interested in is that SPIRV OpControlBarrier with device scope gets translated into threadgroup_barrier(mem_device) [2]. That's insufficient to make decoupled look-back work. Conversely, if you're not using decoupled look-back, you're not getting the full throughput on GPUs that do support that barrier. I'm wondering how your infrastructure deals with that.

[1]: https://developer.nvidia.com/blog/programming-tensor-cores-c...

[2]: https://github.com/linebender/piet-gpu/blob/d81e5cb4ee145abd...

So with Tensorcores you use TF32 which is more like FP19-ish and the marketing makes you think you get 8x the performance. But if you want actual FP32 precision you will need something like [1] but then your performance in the Tensorcore path is _only_ 2X faster than the SIMT path.

I'll leave the prefix sum for other devs who know more :D

https://github.com/NVIDIA/cutlass/blob/master/examples/27_am...

//part of nod.ai/shark team

I think we're talking past each other to some extent. Putting aside the question of how misleading it is to market a 16 bit multiply as a "TF32" operation, this is all about tradeoffs. The specific tradeoff that these tensor cores make is that in exchange for reduced precision (and a programming model which is even more of a pain than ordinary compute shaders, an astonishing achievement in and of itself), you get a lot more throughput. For certain AI workloads, particularly inference, that tradeoff is well worth it.

Reading between the lines a little, it sounds like your infrastructure is potentially able to exploit a good deal of the available throughput for FP32 workloads. That's great, and I'm happy to see it! However, for workloads that don't need that much precision, the tradeoff might be a lot less advantageous to M1. That may change again if and when Apple opens up lower-level APIs to their hardware, or reverse engineering delivers usable results.

tf32 and fp16 tensor cores are completely different, and tf32 is not 16 bit multiplication.