Huh? I thought the issue before ringattention is the memory requirement of the softmax layer, since you have to load the whole matrix in at once? It's O(s^2) no?
But no, FlashAttention already solved the memory requirements of attention. RingAttention is primarily useful for parallelizing across the sequence component.
But no, FlashAttention already solved the memory requirements of attention. RingAttention is primarily useful for parallelizing across the sequence component.