|
|
|
|
|
by menaerus
483 days ago
|
|
Pretty significant improvements. However, my back on the napkin math suggests that MLA, FlashAttention and similar optimizations will provide the benefits only when memory access time dominates the compute in attention implementation? Those would be the prefill-phase (or TTFT) and training (when batch_size >> 1) but not the decode phase (inference)? |
|
Training and prefill are compute bound. Decode is memory bound. FlashAttention massively increases the arithmetic intensity of naive MHA, such that you can remain compute bound at lower batch sizes during decode.