I shipped a dozen products with them (mostly video games), so there's nothing "wrong" that would make them unusable.
But programming them and setting up the graphics pipe (and all the passes, structured buffers, compiling, binding, weird errors, and synchronization) is a huge PITA as compared to the convenience of CUDA.
Compilers are way less mature, especially on some platforms cough.
Some GPU capabilities are not exposed.
No real composability or libraries.
No proper debugging.
These days, some game engines have done pretty well at making compute shaders easy to use (such as Bevy [1] -- disclaimer, I contribute to that engine). But telling the scientific/financial/etc. community that they need to run their code inside a game engine to get a decent experience is a hard sell. It's not a great situation compared to how easy it is on NVIDIA's stack.
I have recently published an AI-related open-source project entirely based on compute shaders https://github.com/Const-me/Cgml and I’m super happy with the workflow. Possible to implement very complicated things without compiling a single line of C++, the software is mostly in C#.
> setting up the graphics pipe
I’ve picked D3D11, as opposed to D3D12 or Vulkan. The 11 is significantly higher level, and much easier to use.
> compiling, binding
The compiler is design-time, I ship them compiled, and integrated into the IDE. I solved the bindings with a simple code generation tool, which parses HLSL and generates C#.
I understand why you've picked D3D11, but people have to understand that comes with serious limitations. There are no subgroups, which also means no cooperative matrix multiplication ("tensor cores"). For throughput in machine learning inference in particular, there's no way D3D11 can compete with either CUDA or a more modern compute shader stack, such as one based on Vulkan 1.3.
When running inference on end-user computers, for many practical applications users don’t care about throughput. They only have a single audio stream / chat / picture being generated, their batch size is a small number often just 1, and they mostly care about latency, not throughput. Under these conditions inference is guaranteed to bottleneck on memory bandwidth, as opposed to compute. For use cases like that, tensor cores are useless.
> there's no way D3D11 can compete with either CUDA
Sure. It's a tradeoff space. Gain portability and ergonomics, lose throughput. For applications that are throttled by TOPS at low precisions (ie most ML inferencing) then the performance drop from not being able to use tensor cores is going to be unacceptable. Glad you found something that works for you, but it certainly doesn't spell the end of CUDA.
Most ML inferencing is throttled with memory, not compute. This certainly applies to both Whisper and Mistral models.
> it certainly doesn't spell the end of CUDA
No, because traditional HPC. Some people in the industry spent many man-years developing very complicated compute kernels, which are very expensive to port.
AI is another story. Not too hard to port from CUDA to compute shaders, because the GPU-running code is rather simple.
Moreover, it can help with performance just by removing abstraction layers. I think the reason why compute shaders-based Whisper outperformed CUDA-based version on the same GPU, these implementations do slightly different things. Unlike Python and Torch, compute shaders actually program GPUs as opposed to calling libraries with tons of abstractions layers inside them. This saves memory bandwidth storing and then loading temporary tensors.
This. It's crazy how primitive the GPU development process still is in the year 2023. Yeah it's gotten better, but there's still a massive gap with traditional development.
It's kinda like building Legos vs building actual Skyscrapers. The gap between compute shaders and CUDA is massive. At least it feels massive because CUDA has some key features that compute shaders lack, and which make it so much easier to build complex, powerful and fast applications.
One of the features that would get compute shaders far ahead compared to now would be pointers and pointer casting - Just let me have a byte buffer and easily cast the bytes to whatever I want. Another would be function pointers. These two are pretty much the main reason I had to stop doing a project in OpenGL/Vulkan, and start using CUDA. There are so many more, however, that make life easier like cooperative groups with device-wide sync, being able to allocate a single buffer with all the GPU memory, recursion, etc.
Khronos should start supporting C++20 for shaders (basically what CUDA is) and stop the glsl or spirv nonsense.
You might argue for forking off from glsl and SPIR-V for complex compute workloads, but lightweight, fast compilers for a simple language like glsl do solve issues for graphics. Some graphics use cases don't get around shipping a shader compiler to the user. The number of possible shader configurations is often either insanely large or just impossible to enumerate, so on the fly compilation is really the only thing you can do.
Ironically, most people use HLSL with Vulkan, because Khronos doesn't have a budget nor the people to improve GLSL.
So yet another thing where Khronos APIs are dependent on DirectX evolution.
It used to be that AMD and NVidia would first implement new stuff on DirectX in collaboration with Microsoft, have them as extensions in OpenGL, and eventually as standard features.
For GPGPU tasks, they lack a lot of useful features that CUDA has like the ability to allocate memory and launch kernels from the GPU. They also generally require you to write your GPU and CPU portions of an algorithm in different languages, while CUDA allows you to intermix your code and share data structures and simple functions between the two.
There are existing efforts to compile SYCL to Vulkan compute shaders. Plenty of "weird quirks" involved since they're based on different underlying varieties of SPIR-V ("kernels" vs. "shaders") and seem to have evolved independently in other ways (Vulkan does not have the amount of support for numerical computation that OpenCL/SYCL has) - but nothing too terrible or anything that couldn't be addressed by future Vulkan extensions.
Vulkan 1.3 has pointers, thanks to buffer device address[1]. It took a while to get there, and earlier pointer support was flawed. I also don't know of any major applications that use this.
Modern Vulkan is looking pretty good now. Cooperative matrix multiplication has also landed (as a widely supported extension), and I think it's fair to say it's gone past OpenCL.
Whether we get significant adoption of all this I think is too early to say, but I think it's a plausible foundation for real stuff. It's no longer just a toy.
Is IREE the main runtime doing Vulkan or are there others? Who should we be listening to (oh wise @raphlinus)?
It's been awesome seeing folks like Keras 3.0 kicking out broad Intercompatibility across JAX, TF, Pytorch, powered by flexible executuon engines. Looking forward to seeing more Vulkan based runs getting socialized benchmarked & compared. https://news.ycombinator.com/item?id=38446353
The two I know of are IREE and Kompute[1]. I'm not sure how much momentum the latter has, I don't see it referenced much. There's also a growing body of work that uses Vulkan indirectly through WebGPU. This is currently lagging in performance due to lack of subgroups and cooperative matrix mult, but I see that gap closing. There I think wonnx[2] has the most momentum, but I am aware of other efforts.
How feasible would it be to target Vulkan 1.3 or such from standard SYCL (as first seen in Sylkan, for earlier Vulkan Compute)? Is it still lacking the numerical properties for some math functions that OpenCL and SYCL seem to expect?
"Using a pointer in a shader - In Vulkan GLSL, there is the GL_EXT_buffer_reference extension "
That extension is utter garbage. I tried it. It was the last thing I tried before giving up on GLSL/Vulkan and switching to CUDA. It was the nail in the coffin that made me go "okay, if that's the best Vulkan can do, then I need to switch to CUDA". It's incredibly cumbersome, confusing and verbose.
What's needed are regular, simple, C-like pointers.
Compute shaders are not capable of using modern GPU features like tensor cores or many of the other features needed to feed tensor cores data fast enough (e.g. TMA/cp.async.shared)