|
|
|
|
|
by touisteur
824 days ago
|
|
I must be in a niche where we're consistently crushing cublas, cusolver and cudnn, sometimes cutlass with internship-level competency, mostly because our problem-sizes are not in the cone of optimization of the Olympians of NVIDIA. Large batches of small matrices, specific matrix forms, long kernel pipelines... Also until all of these libraries are made amenable to kernel fusion or just sometimes prologue/epilogue features they can be beaten on memory bandwidth with pretty lowly-optimized kernels with no global memory traffic. I'm very glad cuFFT and cuBLAS are getting 'device' (Dx) versions, and NVIDIA is getting wiser on the kernel-fusion track. They're amazingly fast and game-changing but they're still not covering a big chunk of the original libraries. Also, a lot of problems that are amenable to GPU compute are not expressed in blas/dnn and still can be very, very simply expressed as CUDA code, and still extract huge performance gains against CPUs, without a chance that the Olympians will ever get an interest to your problem space. |
|
I know you probably don't mean to say that Nvidia can't write good CUDA, but this does sort of illustrate how hard that is. I've seen similar cases (tiny matrix multiplied by enormous matrix) in which it was possible to write something faster than Nvidia's library. I'm not sure if this has been addressed since though.
> they can be beaten on memory bandwidth with pretty lowly-optimized kernels
This is partly why I believe most CUDA code probably isn't "good" - there's this enormous gulf between acceptable and good which often isn't worth crossing.