But no, FlashAttention already solved the memory requirements of attention. RingAttention is primarily useful for parallelizing across the sequence component.
How do you do matrix vector attention without keeping the full matrix in cache, surely you don't just load unload it a million times
How do you do matrix vector attention without keeping the full matrix in cache, surely you don't just load unload it a million times