|
|
|
|
|
by danielhanchen
799 days ago
|
|
Oh with Flash Attention, you never have to construct the (S, S) matrix ever (also in article) Since its softmax(Q @ K^T / sqrt(d)) @ V, you can form the final output in tiles. In Unsloth, memory usage scales linearly (not quadratically) due to Flash Attention (+ you get 2x faster finetuning, 80% less VRAM use + 2x faster inference). Still O(N^2) FLOPs though. On that note, on long contexts, Unsloth's latest release fits 4x longer contexts than HF+FA2 with +1.9% overhead. So 228K context on H100. |
|