Hacker News new | ask | show | jobs
by rldjbpin 264 days ago
> please be warned that this really is research code; it is sensitive to compiler versions, GPU setup, and sometimes even being looked at the wrong way

the writeup is a classic example of what we lose through abstraction and how writing custom (and optimized) code still beats sticking to high-level implementations.

i would go further and say that the "megakernel" written as part of the optimization is highly-model dependent as well.

the whole "cuda moat" is from the generic implementations of the moving parts of the model architecture. at the same time, you lose a lot of performance through the generic code. it is like comparing writing a stock trading algo in next.js vs assembly.

training models is another landscape altogether, so props to those who can quickly adapt to the hardware they got.