|
|
|
|
|
by raphlinus
1577 days ago
|
|
Thanks for the matmul sizes, but the question I am more interested in is precision. Matrix multiply throughput can be dramatically higher for tensor cores[1] than normal shader ALU, specifically for reduced precision arithmetic. I'm wondering to what extent that's accessible on the M1, and to what extent IREE can address them. Regarding prefix sum, the specific question I'm interested in is that SPIRV OpControlBarrier with device scope gets translated into threadgroup_barrier(mem_device) [2]. That's insufficient to make decoupled look-back work. Conversely, if you're not using decoupled look-back, you're not getting the full throughput on GPUs that do support that barrier. I'm wondering how your infrastructure deals with that. [1]: https://developer.nvidia.com/blog/programming-tensor-cores-c... [2]: https://github.com/linebender/piet-gpu/blob/d81e5cb4ee145abd... |
|
I'll leave the prefix sum for other devs who know more :D
https://github.com/NVIDIA/cutlass/blob/master/examples/27_am...
//part of nod.ai/shark team