Hacker News new | ask | show | jobs
by Const-me 918 days ago
I have recently published an AI-related open-source project entirely based on compute shaders https://github.com/Const-me/Cgml and I’m super happy with the workflow. Possible to implement very complicated things without compiling a single line of C++, the software is mostly in C#.

> setting up the graphics pipe

I’ve picked D3D11, as opposed to D3D12 or Vulkan. The 11 is significantly higher level, and much easier to use.

> compiling, binding

The compiler is design-time, I ship them compiled, and integrated into the IDE. I solved the bindings with a simple code generation tool, which parses HLSL and generates C#.

> No proper debugging

I partially agree but still, we have renderdoc.

1 comments

I understand why you've picked D3D11, but people have to understand that comes with serious limitations. There are no subgroups, which also means no cooperative matrix multiplication ("tensor cores"). For throughput in machine learning inference in particular, there's no way D3D11 can compete with either CUDA or a more modern compute shader stack, such as one based on Vulkan 1.3.
> no subgroups

Indeed, in D3D they are called “wave intrinsics” and require D3D12. But that’s IMO a reasonable price to pay for hardware compatibility.

> no cooperative matrix multiplication

Matrix multiplication compute shader which uses group shared memory for cooperative loads: https://github.com/Const-me/Cgml/blob/master/Mistral/Mistral...

> tensor cores

When running inference on end-user computers, for many practical applications users don’t care about throughput. They only have a single audio stream / chat / picture being generated, their batch size is a small number often just 1, and they mostly care about latency, not throughput. Under these conditions inference is guaranteed to bottleneck on memory bandwidth, as opposed to compute. For use cases like that, tensor cores are useless.

> there's no way D3D11 can compete with either CUDA

My D3D11 port of Whisper outperformed original CUDA-based implementation running on the same GPU: https://github.com/Const-me/Whisper/

Sure. It's a tradeoff space. Gain portability and ergonomics, lose throughput. For applications that are throttled by TOPS at low precisions (ie most ML inferencing) then the performance drop from not being able to use tensor cores is going to be unacceptable. Glad you found something that works for you, but it certainly doesn't spell the end of CUDA.
> ie most ML inferencing

Most ML inferencing is throttled with memory, not compute. This certainly applies to both Whisper and Mistral models.

> it certainly doesn't spell the end of CUDA

No, because traditional HPC. Some people in the industry spent many man-years developing very complicated compute kernels, which are very expensive to port.

AI is another story. Not too hard to port from CUDA to compute shaders, because the GPU-running code is rather simple.

Moreover, it can help with performance just by removing abstraction layers. I think the reason why compute shaders-based Whisper outperformed CUDA-based version on the same GPU, these implementations do slightly different things. Unlike Python and Torch, compute shaders actually program GPUs as opposed to calling libraries with tons of abstractions layers inside them. This saves memory bandwidth storing and then loading temporary tensors.