| HN Mirror

> ie most ML inferencing

Most ML inferencing is throttled with memory, not compute. This certainly applies to both Whisper and Mistral models.

> it certainly doesn't spell the end of CUDA

No, because traditional HPC. Some people in the industry spent many man-years developing very complicated compute kernels, which are very expensive to port.

AI is another story. Not too hard to port from CUDA to compute shaders, because the GPU-running code is rather simple.

Moreover, it can help with performance just by removing abstraction layers. I think the reason why compute shaders-based Whisper outperformed CUDA-based version on the same GPU, these implementations do slightly different things. Unlike Python and Torch, compute shaders actually program GPUs as opposed to calling libraries with tons of abstractions layers inside them. This saves memory bandwidth storing and then loading temporary tensors.