Hacker News new | ask | show | jobs
by sidkshatriya 492 days ago
This is a paper by DeepSeek. It would be a good idea to mention that in the title.

TL;DR: This is a very interesting paper about attention calculation in transformers. It shows how attention can be calculated over a large token window without saturating memory and/or GPU arithmetic abilities.

Usually attention is a sliding window of tokens. The window can turn out to be too big due to the quadratic nature of attention which increases the amount of computation required. There are many papers on how to get some of the benefits of transformers by doing "sparse attention" -- i.e. avoiding some of the quadratic blowup.

The solution in the paper is first divide every `x` tokens into groups or "blocks".

(1) Capture long range conections by compressing blocks of tokens to a single token

(2) Select important tokens by only choosing the tokens in the "important" blocks

(3) Select recent tokens by using a sliding window (like normal transformers)

Compression of a block of tokens to a single token in (1) is done by an MLP that is trained during normal training time.

Now attention scores can be done for an incoming token with the preceding block of tokens. Select only the top-k blocks which have high attention scores for (1).

Finally combine the results of attention of incoming tokens with (1), (2) and (3) to give you a final output token. You get long range coarse attention, attention to selective blocks and the usual sliding window attention. Awesome !

This is sort of engineering type paper also with lots of low level details.

Question for the authors: Why not do the experiments with MHLA also (multi head latent attention) that is there in DeepSeek V3 and R1 ?