Hacker News new | ask | show | jobs
by kristjansson 853 days ago
FlashAttention(2)[0] reduces context-length space complexity to linear. Compute is still O(n^2) in length though, AFAIK, so we'd expect these long sequence lengths to take some time to compute.

I'm a bit out of my depth, but I think ultra-long exact-attention work like this also probably has to answer some questions about where to put the KV-cache before it can be used in practice?

[0]: https://arxiv.org/abs/2205.14135