Hacker News new | ask | show | jobs
by johndough 2598 days ago
The answer is: No one really knows because cuBLAS is closed source.

But to get within the same order of magnitude, tiling the workload for better cache utilization is usually the most important step. This article [1] explains it quite well and also lists a few other tricks.

In addition, there's also the fast Fourier transform for large filter kernels and Winograd convolutions [2] for small filter kernels.

[1] https://cnugteren.github.io/tutorial/pages/page1.html

[2] https://arxiv.org/pdf/1509.09308.pdf

1 comments

Not entirely true - Scott Gray knows: https://github.com/NervanaSystems/maxas/wiki/SGEMM

IIRC his kernels shipped in cuBLAS at some point.