|
|
|
|
|
by kelseyfrog
620 days ago
|
|
The values between the groups are also going to diverge during training due to the structure of the DiffAttn equation. The analogy I can think of is when you're paying attention to a variety of things and you actively avoid concentrating on something because it will distract you. You don't give it zero attention, you give it negative attention. |
|