Hacker News new | ask | show | jobs
by rishabhaiover 180 days ago
Sorry for not being clear. We had two different CUDA functions, one was for Attention and one was for the MLP. Here's the kernel code: https://github.com/sankirthk/GPT2-Kernel-Fusion/blob/main/ke...

We saw different results of pipelining with the Attention kernel vs the MLP kernel (since MLP W1 has to project the attention results into a much higher dimension, the arithmetic intensity shifts towards compute bound characteristics)

1 comments

Agreed, this observation holds true for both decode and prefill. Thanks for sharing the code