Hacker News new | ask | show | jobs
by FL33TW00D 483 days ago
You have it backwards.

Training and prefill are compute bound. Decode is memory bound. FlashAttention massively increases the arithmetic intensity of naive MHA, such that you can remain compute bound at lower batch sizes during decode.

1 comments

> Decode is memory bound.

> FlashAttention ... such that you can remain compute bound at lower batch sizes during decode.

So, which one is it then?

It depends on the batch size and the accelerator you're running on! Decode is *typically* memory bound unless you can hit high batch sizes (in the hundreds), which is hard during serving due to the contention between batch size and low TTFT.

https://jax-ml.github.io/scaling-book/inference/ - good read!