|
|
|
|
|
by qz94
612 days ago
|
|
It does sound like we're hindering the model a bit by allowing negative weights to exist instead of sending them through, say, a ReLU. But, dealing with this might be an easier problem than you think for the model. In the first diagram with the attention weights, there actually are some negative scores in the noise section. But, the attention to that section is very small anyway. All the second attention map needs to do is predict the noise in the first one -- a task that can be done very accurately, because it has full access to the input of the first. To refer back to their real-world comparison, noise-canceling headphones have access to what your ear hears through a microphone, so they can output exactly the right cancellation signal. Similarly, the second attention map knows what's being input into the first one, so it can output a corresponding cancellation signal. It's not perfect -- just as noise-canceling headphones aren't perfect -- but it still gets you 99% of the way there, which is enough to boost performance. |
|