| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by volta83 1786 days ago

> Previously, you needed to rely on cuBLAS for fast hand-written Matrix Multiply kernels, and then your LeakyReLU would need to read that result out of memory,

You could do that, but you can also just tell cuBLAS to fuse ReLU, by just passing the "CUBLASLT_EPILOGUE_RELU" option (among others), see the manual: https://docs.nvidia.com/cuda/cublas/index.html#cublasLtEpilo...

This has been possible for years. It's the kind of 1 line change that makes a big difference.