Hacker News new | ask | show | jobs
by Const-me 921 days ago
> no subgroups

Indeed, in D3D they are called “wave intrinsics” and require D3D12. But that’s IMO a reasonable price to pay for hardware compatibility.

> no cooperative matrix multiplication

Matrix multiplication compute shader which uses group shared memory for cooperative loads: https://github.com/Const-me/Cgml/blob/master/Mistral/Mistral...

> tensor cores

When running inference on end-user computers, for many practical applications users don’t care about throughput. They only have a single audio stream / chat / picture being generated, their batch size is a small number often just 1, and they mostly care about latency, not throughput. Under these conditions inference is guaranteed to bottleneck on memory bandwidth, as opposed to compute. For use cases like that, tensor cores are useless.

> there's no way D3D11 can compete with either CUDA

My D3D11 port of Whisper outperformed original CUDA-based implementation running on the same GPU: https://github.com/Const-me/Whisper/

1 comments

Sure. It's a tradeoff space. Gain portability and ergonomics, lose throughput. For applications that are throttled by TOPS at low precisions (ie most ML inferencing) then the performance drop from not being able to use tensor cores is going to be unacceptable. Glad you found something that works for you, but it certainly doesn't spell the end of CUDA.
> ie most ML inferencing

Most ML inferencing is throttled with memory, not compute. This certainly applies to both Whisper and Mistral models.

> it certainly doesn't spell the end of CUDA

No, because traditional HPC. Some people in the industry spent many man-years developing very complicated compute kernels, which are very expensive to port.

AI is another story. Not too hard to port from CUDA to compute shaders, because the GPU-running code is rather simple.

Moreover, it can help with performance just by removing abstraction layers. I think the reason why compute shaders-based Whisper outperformed CUDA-based version on the same GPU, these implementations do slightly different things. Unlike Python and Torch, compute shaders actually program GPUs as opposed to calling libraries with tons of abstractions layers inside them. This saves memory bandwidth storing and then loading temporary tensors.