|
|
|
|
|
by david-gpu
825 days ago
|
|
I used to work at NVidia on the design of their tensor cores. As you can imagine, I had to be rather familiar with various kinds of high performance kernels that people are talking about in this thread. I second the GP: nobody in their right mind would try to compete with the performance or functionality of libraries like cuDNN/ or cuBLAS. NVidia pays for an army of exceptionally skilled folks to write these high performance kernels, working hand in hand with the architects that design the hardware, and with access to various sophisticated tools and performance models beyond what is available to the general public. It would be like trying to compete against Olympians, to use an analogy that we can all understand. |
|
Also until all of these libraries are made amenable to kernel fusion or just sometimes prologue/epilogue features they can be beaten on memory bandwidth with pretty lowly-optimized kernels with no global memory traffic.
I'm very glad cuFFT and cuBLAS are getting 'device' (Dx) versions, and NVIDIA is getting wiser on the kernel-fusion track. They're amazingly fast and game-changing but they're still not covering a big chunk of the original libraries.
Also, a lot of problems that are amenable to GPU compute are not expressed in blas/dnn and still can be very, very simply expressed as CUDA code, and still extract huge performance gains against CPUs, without a chance that the Olympians will ever get an interest to your problem space.