Hacker News new | ask | show | jobs
by WithinReason 623 days ago
Don't look for an analogy, this just adds a new mathematical capability. It enables "negative attention", the network can say "I want to subtract the contribution of this token" in the attention calculation. Previously it could only reduce how much it adds.

The simple way of doing this would be to just remove the softmax or use a sigmoid instead, but in practice a softmax works better it seems.