| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rishabhaiover 181 days ago
	I did an experiment on FlashAttention in Triton to measure the impact of caching tiles in the Shared Memory. Surprisingly, it had a non-monotonic relationship with prefetching these tiles and it was kernel dependent. Attention kernel benefits from prefetching caches while MLP W1 doesn't.

1 comments

amindiro 180 days ago

Very interesting and Would love to see the experiments. Quick question: what do you mean about kernel dependent ?

link

rishabhaiover 180 days ago

Sorry for not being clear. We had two different CUDA functions, one was for Attention and one was for the MLP. Here's the kernel code: https://github.com/sankirthk/GPT2-Kernel-Fusion/blob/main/ke...

We saw different results of pipelining with the Attention kernel vs the MLP kernel (since MLP W1 has to project the attention results into a much higher dimension, the arithmetic intensity shifts towards compute bound characteristics)

link

amindiro 180 days ago

Agreed, this observation holds true for both decode and prefill. Thanks for sharing the code

link