|
|
|
|
|
by nowayno583
614 days ago
|
|
Does anyone understand why they are taking the difference between transformers instead of the sum? It seems to me that in a noise reducing solution we would be more interested in the sum, as random noise would cancel out and signal would be constructive. Of course, even if I'm right proper training would account to that by inverting signs where appropriate. Still, it seems weird to present it as the difference, especially seeing as they compare this directly to noise cancelling headphones, where we sum both microphones inputs. |
|
As pointed out by a different comment, it's actually the attention we are interested in that is cancelled out *if they are both equal*. This is what the paper mentions in its abstract;
> promoting the emergence of sparse attention patterns
In theory, it is quite clever, and their results seem to back it up.