| HN Mirror

Agree on shorter, disagree on simpler. The hard part of understanding GPU code is knowing the reasons why algorithms are the way they are. For example, why we do a split-k decomposition when doing a matrix multiplication, or why are we loading this particular data into shared memory at this particular time, with some overlapping subset into registers.

Getting rid of the for loop over an array index doesn't make it easier to understand the hard parts. Losing the developer perf and debug tooling is absolutely not worth the tradeoff.

For me I'd rather deal with Jax or Numba, and if that still wasn't enough, I would jump straight to CUDA.

It's possible I'm an old fogey with bias, though. It's true that I've spent a lot more time with CUDA than with the new DSLs on the block.