|
|
|
|
|
by rldjbpin
264 days ago
|
|
> please be warned that this really is research code; it is sensitive to compiler versions, GPU setup, and sometimes even being looked at the wrong way the writeup is a classic example of what we lose through abstraction and how writing custom (and optimized) code still beats sticking to high-level implementations. i would go further and say that the "megakernel" written as part of the optimization is highly-model dependent as well. the whole "cuda moat" is from the generic implementations of the moving parts of the model architecture. at the same time, you lose a lot of performance through the generic code. it is like comparing writing a stock trading algo in next.js vs assembly. training models is another landscape altogether, so props to those who can quickly adapt to the hardware they got. |
|