Hacker News new | ask | show | jobs
by danielhanchen 799 days ago
Oh with Flash Attention, you never have to construct the (S, S) matrix ever (also in article) Since its softmax(Q @ K^T / sqrt(d)) @ V, you can form the final output in tiles.

In Unsloth, memory usage scales linearly (not quadratically) due to Flash Attention (+ you get 2x faster finetuning, 80% less VRAM use + 2x faster inference). Still O(N^2) FLOPs though.

On that note, on long contexts, Unsloth's latest release fits 4x longer contexts than HF+FA2 with +1.9% overhead. So 228K context on H100.