|
|
|
|
|
by johndough
2598 days ago
|
|
The answer is: No one really knows because cuBLAS is closed source. But to get within the same order of magnitude, tiling the workload for better cache utilization is usually the most important step. This article [1] explains it quite well and also lists a few other tricks. In addition, there's also the fast Fourier transform for large filter kernels and Winograd convolutions [2] for small filter kernels. [1] https://cnugteren.github.io/tutorial/pages/page1.html [2] https://arxiv.org/pdf/1509.09308.pdf |
|
IIRC his kernels shipped in cuBLAS at some point.