| > no subgroups Indeed, in D3D they are called “wave intrinsics” and require D3D12. But that’s IMO a reasonable price to pay for hardware compatibility. > no cooperative matrix multiplication Matrix multiplication compute shader which uses group shared memory for cooperative loads: https://github.com/Const-me/Cgml/blob/master/Mistral/Mistral... > tensor cores When running inference on end-user computers, for many practical applications users don’t care about throughput. They only have a single audio stream / chat / picture being generated, their batch size is a small number often just 1, and they mostly care about latency, not throughput. Under these conditions inference is guaranteed to bottleneck on memory bandwidth, as opposed to compute. For use cases like that, tensor cores are useless. > there's no way D3D11 can compete with either CUDA My D3D11 port of Whisper outperformed original CUDA-based implementation running on the same GPU: https://github.com/Const-me/Whisper/ |