| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by islewis 661 days ago

> Differential attention takes the difference between two softmax attention functions to eliminate attention noise

If I understand correctly, this architecture trades twice as much attention memory in exchange for either a higher quality model, or less parameters at a similar quality.

> According to the fitted curves, 6.8B-size DIFF Transformer achieves a validation loss comparable to 11B-size Transformer, requiring only 62.2% of parameters

This raises a few questions for me:

- Would having only 60% of the parameters negate the double space for attention, leaving a similar memory profile as a traditional transformer?

- Does that tradeoff change noticeably between training and inference?

4 comments

_hl_ 661 days ago

My understanding was that the extra parameters required for the second attention mechanism are included in those 6.8B parameters (i.e. those are the total parameters of the model, not some made-up metric of would-be parameter count in a standard transformer). This makes the result doubly impressive!

Here's the bit from the paper:

> We set the number of heads h = dmodel/2d, where d is equal to the head dimension of Transformer. So we can align the parameter counts and computational complexity.

In other words, they make up for it by having only half as many attention heads per layer.

link

chessgecko 661 days ago

I think they mitigated the extra memory/compute from this by using half the number of overall heads and doubling V and O. Without actually checking the math I think it should be equivalent in flops, not counting the extra (cheap) multiply by const and subtract.

link

entropicdrifter 661 days ago

I think it would negate the RAM savings, but it would also reduce the amount of storage needed at rest and possibly reduce initial start up times depending on storage speed and model size. So, possibly good for low-end models on consumer devices?

link

Kubuxu 661 days ago

It would double the size of the KV cache, which can be significant (multi-GB) at larger context sizes.

link