| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by danielhanchen 799 days ago

Oh with Flash Attention, you never have to construct the (S, S) matrix ever (also in article) Since its softmax(Q @ K^T / sqrt(d)) @ V, you can form the final output in tiles.

In Unsloth, memory usage scales linearly (not quadratically) due to Flash Attention (+ you get 2x faster finetuning, 80% less VRAM use + 2x faster inference). Still O(N^2) FLOPs though.

On that note, on long contexts, Unsloth's latest release fits 4x longer contexts than HF+FA2 with +1.9% overhead. So 228K context on H100.