Hacker News new | ask | show | jobs
by thomasahle 1100 days ago
I wonder how this compares to Flash Attention (https://github.com/HazyResearch/flash-attention), which is the other "memory aware" Attention project I'm aware of.

I guess Flash Attention is more about utilizing memory GPU SRam correctly, where this is more about using the OS/CPU memory better?

2 comments

I think they are orthogonal.

Flash attention is just another way to compute exact attention.

This work mainly concerns how to resolve memory fragmentation across different sequences

You still need to compute attention as is once you retrieve the needed key values

Thanks for the explanation! I believe the two ideas are basically orthogonal. FlashAttention reduces memory read/writes, while PagedAttention reduces memory waste.
The ideas are orthogonal, and can be used (theoretically) at the same time.
I believe you can slightly change the flash attention kernel to implement the same kernel of this page attention, since both of them work on the key/value cache at block level.