|
|
|
|
|
by thomasahle
1100 days ago
|
|
I wonder how this compares to Flash Attention (https://github.com/HazyResearch/flash-attention), which is the other "memory aware" Attention project I'm aware of. I guess Flash Attention is more about utilizing memory GPU SRam correctly, where this is more about using the OS/CPU memory better? |
|
Flash attention is just another way to compute exact attention.
This work mainly concerns how to resolve memory fragmentation across different sequences
You still need to compute attention as is once you retrieve the needed key values