| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by thomasahle 1100 days ago
	I wonder how this compares to Flash Attention (https://github.com/HazyResearch/flash-attention), which is the other "memory aware" Attention project I'm aware of. I guess Flash Attention is more about utilizing memory GPU SRam correctly, where this is more about using the OS/CPU memory better?

2 comments

karmasimida 1100 days ago

I think they are orthogonal.

Flash attention is just another way to compute exact attention.

This work mainly concerns how to resolve memory fragmentation across different sequences

You still need to compute attention as is once you retrieve the needed key values

link

wskwon 1100 days ago

Thanks for the explanation! I believe the two ideas are basically orthogonal. FlashAttention reduces memory read/writes, while PagedAttention reduces memory waste.

link

ipsum2 1100 days ago

The ideas are orthogonal, and can be used (theoretically) at the same time.

link

scv119 1100 days ago

I believe you can slightly change the flash attention kernel to implement the same kernel of this page attention, since both of them work on the key/value cache at block level.

link