|
|
|
|
|
by dnnssl2
926 days ago
|
|
Is this still the case for sliding window attention/streaming LLMs, where you have a fixed length attention window rather than infinitely passing in new tokens for quadratic scaling? You even get better performance due to purposely downsampling non-meaningful attention sink tokens. |
|
I mean, practically speaking, completions from say, ChatGPT or Claude take seconds to finish :)