|
|
|
|
|
by WithinReason
623 days ago
|
|
Don't look for an analogy, this just adds a new mathematical capability. It enables "negative attention", the network can say "I want to subtract the contribution of this token" in the attention calculation. Previously it could only reduce how much it adds. The simple way of doing this would be to just remove the softmax or use a sigmoid instead, but in practice a softmax works better it seems. |
|