Hacker News new | ask | show | jobs
by zaptrem 708 days ago
> Is FlashAttention simply a drop-in replacement for the attention operation in an LLM? Can it be used anywhere that an "attention" operation is used? Or does a LLM need to be trained specially to use FA?

Yes

> How does FA relate to attention strategies like GQA (grouped query attention) or sliding-window attention? Are they orthogonal concepts? Or you need a specific FA implementation for each strategy?

Flash Attention is a way of calculating the Softmax(QK^T)V part of attention, whereas GQA is a way of calculating the Q, K, and V matricies. Sliding window attention (less sure about this, there are a bunch of windowed attention techniques) change the attention mask (the thing that controls which queries can attend to which keys).

> Recently llama.cpp added flash attention support - does this just mean they started consuming a flash attention-provided CUDA kernel or something?

I don't use llama.cpp but that sounds about right.

> lastly, in this post, they compare FlashAttention to Triton. I thought Triton was like an abstraction layer? Couldn't FA be implemented in Triton? I just don't really get what it means to say "FlashAttention vs. Triton".

They're talking about a previous Flash Attention implementation written in Triton.