Hacker News new | ask | show | jobs
by tylerneylon 799 days ago
Awesome video. This helps to show how the Q*K matrix multiplication is a bottleneck, because if you have sequence (context window) length S, then you need to store an SxS size matrix (the result of all queries times all keys) in memory.

One great way to improve on this bottleneck is a new-ish idea called Ring Attention. This is a good article explaining it:

https://learnandburn.ai/p/how-to-build-a-10m-token-context

(I edited that article.)

2 comments

Oh with Flash Attention, you never have to construct the (S, S) matrix ever (also in article) Since its softmax(Q @ K^T / sqrt(d)) @ V, you can form the final output in tiles.

In Unsloth, memory usage scales linearly (not quadratically) due to Flash Attention (+ you get 2x faster finetuning, 80% less VRAM use + 2x faster inference). Still O(N^2) FLOPs though.

On that note, on long contexts, Unsloth's latest release fits 4x longer contexts than HF+FA2 with +1.9% overhead. So 228K context on H100.

He lists Ring Attention and half a dozen other techniques, but they're not within the scope of this video: https://youtu.be/eMlx5fFNoYc?t=784