| HN Mirror

Without reading the paper, I think you have it a little backwards - the IR doesn't itself allow for more general functions. More general functions are possible (in theory) because the frontend (this Triton language) is decoupled from the backend (CUDA) through the IR as an interface. In this way the Triton IR is no less domain specific than XLA (because both are IRs that represent sequences of operators that run on GPU (or TPU or whatever). I guess in theory Triton could be eschewing all of eg cuDNN but most likely it's not as NVIDIA's closed source kernels perform best on their closed source hardware.

Edit: should've read the post before commenting. Looks like they are in fact using LLVM's PTX backend (ie generating cuda kernels from scratch). Kudos to them