Hacker News new | ask | show | jobs
by ipsum2 1100 days ago
The ideas are orthogonal, and can be used (theoretically) at the same time.
1 comments

I believe you can slightly change the flash attention kernel to implement the same kernel of this page attention, since both of them work on the key/value cache at block level.