Hacker News new | ask | show | jobs
by karmasimida 1100 days ago
I think they are orthogonal.

Flash attention is just another way to compute exact attention.

This work mainly concerns how to resolve memory fragmentation across different sequences

You still need to compute attention as is once you retrieve the needed key values

1 comments

Thanks for the explanation! I believe the two ideas are basically orthogonal. FlashAttention reduces memory read/writes, while PagedAttention reduces memory waste.