Hacker News new | ask | show | jobs
by kelseyfrog 620 days ago
The values between the groups are also going to diverge during training due to the structure of the DiffAttn equation.

The analogy I can think of is when you're paying attention to a variety of things and you actively avoid concentrating on something because it will distract you. You don't give it zero attention, you give it negative attention.