Hacker News new | ask | show | jobs
by rishabhaiover 181 days ago
I did an experiment on FlashAttention in Triton to measure the impact of caching tiles in the Shared Memory. Surprisingly, it had a non-monotonic relationship with prefetching these tiles and it was kernel dependent. Attention kernel benefits from prefetching caches while MLP W1 doesn't.
1 comments

Very interesting and Would love to see the experiments. Quick question: what do you mean about kernel dependent ?
Sorry for not being clear. We had two different CUDA functions, one was for Attention and one was for the MLP. Here's the kernel code: https://github.com/sankirthk/GPT2-Kernel-Fusion/blob/main/ke...

We saw different results of pipelining with the Attention kernel vs the MLP kernel (since MLP W1 has to project the attention results into a much higher dimension, the arithmetic intensity shifts towards compute bound characteristics)

Agreed, this observation holds true for both decode and prefill. Thanks for sharing the code