|
|
|
|
|
by islewis
613 days ago
|
|
> Differential attention takes the difference between two softmax attention functions to eliminate attention noise If I understand correctly, this architecture trades twice as much attention memory in exchange for either a higher quality model, or less parameters at a similar quality. > According to the fitted curves, 6.8B-size
DIFF Transformer achieves a validation loss comparable to 11B-size Transformer, requiring only 62.2% of parameters This raises a few questions for me: - Would having only 60% of the parameters negate the double space for attention, leaving a similar memory profile as a traditional transformer? - Does that tradeoff change noticeably between training and inference? |
|
Here's the bit from the paper:
> We set the number of heads h = dmodel/2d, where d is equal to the head dimension of Transformer. So we can align the parameter counts and computational complexity.
In other words, they make up for it by having only half as many attention heads per layer.