Hacker News new | ask | show | jobs
by casercaramel144 843 days ago
Huh? I thought the issue before ringattention is the memory requirement of the softmax layer, since you have to load the whole matrix in at once? It's O(s^2) no?

Also hi horace.

1 comments

Who is this :think:

But no, FlashAttention already solved the memory requirements of attention. RingAttention is primarily useful for parallelizing across the sequence component.

It's camel.

How do you do matrix vector attention without keeping the full matrix in cache, surely you don't just load unload it a million times