|
|
|
|
|
by tylerneylon
799 days ago
|
|
Awesome video. This helps to show how the Q*K matrix multiplication is a bottleneck, because if you have sequence (context window) length S, then you need to store an SxS size matrix (the result of all queries times all keys) in memory. One great way to improve on this bottleneck is a new-ish idea called Ring Attention. This is a good article explaining it: https://learnandburn.ai/p/how-to-build-a-10m-token-context (I edited that article.) |
|
In Unsloth, memory usage scales linearly (not quadratically) due to Flash Attention (+ you get 2x faster finetuning, 80% less VRAM use + 2x faster inference). Still O(N^2) FLOPs though.
On that note, on long contexts, Unsloth's latest release fits 4x longer contexts than HF+FA2 with +1.9% overhead. So 228K context on H100.