| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by FL33TW00D 483 days ago
	You have it backwards. Training and prefill are compute bound. Decode is memory bound. FlashAttention massively increases the arithmetic intensity of naive MHA, such that you can remain compute bound at lower batch sizes during decode.

1 comments

menaerus 483 days ago

> Decode is memory bound.

> FlashAttention ... such that you can remain compute bound at lower batch sizes during decode.

So, which one is it then?

link

FL33TW00D 483 days ago

It depends on the batch size and the accelerator you're running on! Decode is *typically* memory bound unless you can hit high batch sizes (in the hundreds), which is hard during serving due to the contention between batch size and low TTFT.

https://jax-ml.github.io/scaling-book/inference/ - good read!

link