|
|
|
|
|
by raphlinus
917 days ago
|
|
Sure. It's a tradeoff space. Gain portability and ergonomics, lose throughput. For applications that are throttled by TOPS at low precisions (ie most ML inferencing) then the performance drop from not being able to use tensor cores is going to be unacceptable. Glad you found something that works for you, but it certainly doesn't spell the end of CUDA. |
|
Most ML inferencing is throttled with memory, not compute. This certainly applies to both Whisper and Mistral models.
> it certainly doesn't spell the end of CUDA
No, because traditional HPC. Some people in the industry spent many man-years developing very complicated compute kernels, which are very expensive to port.
AI is another story. Not too hard to port from CUDA to compute shaders, because the GPU-running code is rather simple.
Moreover, it can help with performance just by removing abstraction layers. I think the reason why compute shaders-based Whisper outperformed CUDA-based version on the same GPU, these implementations do slightly different things. Unlike Python and Torch, compute shaders actually program GPUs as opposed to calling libraries with tons of abstractions layers inside them. This saves memory bandwidth storing and then loading temporary tensors.