Hacker News new | ask | show | jobs
by amindiro 182 days ago
Very interesting and Would love to see the experiments. Quick question: what do you mean about kernel dependent ?
1 comments

Sorry for not being clear. We had two different CUDA functions, one was for Attention and one was for the MLP. Here's the kernel code: https://github.com/sankirthk/GPT2-Kernel-Fusion/blob/main/ke...

We saw different results of pipelining with the Attention kernel vs the MLP kernel (since MLP W1 has to project the attention results into a much higher dimension, the arithmetic intensity shifts towards compute bound characteristics)

Agreed, this observation holds true for both decode and prefill. Thanks for sharing the code