|
|
|
|
|
by chillee
2481 days ago
|
|
The biggest thing is using lower precision > this implementation gives 25% speedup over Nvidia's Pytorch implementation in full precision and 2.5-3x speedup when using TensorCore TensorCore is a lower precision core. Other than that, the speedups presumably come from better written CUDA. If you're asking what the bottlenecks are in general for these kinds of kernels, it's pretty much always memory bound. View page 3-5 for what kinds of optimizations need to be done: https://people.csail.mit.edu/jrk/jrkthesis.pdf |
|