| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tylerneylon 799 days ago

Awesome video. This helps to show how the Q*K matrix multiplication is a bottleneck, because if you have sequence (context window) length S, then you need to store an SxS size matrix (the result of all queries times all keys) in memory.

One great way to improve on this bottleneck is a new-ish idea called Ring Attention. This is a good article explaining it:

https://learnandburn.ai/p/how-to-build-a-10m-token-context

(I edited that article.)

2 comments

danielhanchen 799 days ago

Oh with Flash Attention, you never have to construct the (S, S) matrix ever (also in article) Since its softmax(Q @ K^T / sqrt(d)) @ V, you can form the final output in tiles.

In Unsloth, memory usage scales linearly (not quadratically) due to Flash Attention (+ you get 2x faster finetuning, 80% less VRAM use + 2x faster inference). Still O(N^2) FLOPs though.

On that note, on long contexts, Unsloth's latest release fits 4x longer contexts than HF+FA2 with +1.9% overhead. So 228K context on H100.

rahimnathwani 799 days ago

He lists Ring Attention and half a dozen other techniques, but they're not within the scope of this video: https://youtu.be/eMlx5fFNoYc?t=784