|
|
|
|
|
by FL33TW00D
483 days ago
|
|
You have it backwards. Training and prefill are compute bound. Decode is memory bound. FlashAttention massively increases the arithmetic intensity of naive MHA, such that you can remain compute bound at lower batch sizes during decode. |
|
> FlashAttention ... such that you can remain compute bound at lower batch sizes during decode.
So, which one is it then?