| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kristjansson 853 days ago

FlashAttention(2)[0] reduces context-length space complexity to linear. Compute is still O(n^2) in length though, AFAIK, so we'd expect these long sequence lengths to take some time to compute.

I'm a bit out of my depth, but I think ultra-long exact-attention work like this also probably has to answer some questions about where to put the KV-cache before it can be used in practice?

[0]: https://arxiv.org/abs/2205.14135