Hacker News new | ask | show | jobs
by bmh 1786 days ago
A toy illustrative example, summing two arrays:

  CUDA
    c[i] = a[i] + b[i]
    i += 1

  Triton
    c[i:i+16] = a[i:i+16] + b[i:i+16]
    i += 16
The 16 in this example is the "block size", and could be anything. But this notion of expressing computation over blocks of dense data seems to be the big difference from other approaches.

A very exciting result of the incredible performance that Triton achieves, is the ability to fuse NN operations such as Matrix Multiply + LeakyReLU + Batch Norm. Previously, you needed to rely on cuBLAS for fast hand-written Matrix Multiply kernels, and then your LeakyReLU would need to read that result out of memory, and then your Batch Norm would read the LeakyReLU out of memory again.

The ability to write very fast kernels, and especially being able to fuse them together, to avoid unnecessary memory round-trips is a big deal!

1 comments

> Previously, you needed to rely on cuBLAS for fast hand-written Matrix Multiply kernels, and then your LeakyReLU would need to read that result out of memory,

You could do that, but you can also just tell cuBLAS to fuse ReLU, by just passing the "CUBLASLT_EPILOGUE_RELU" option (among others), see the manual: https://docs.nvidia.com/cuda/cublas/index.html#cublasLtEpilo...

This has been possible for years. It's the kind of 1 line change that makes a big difference.