|
|
|
|
|
by aDyslecticCrow
617 days ago
|
|
The noise isn't truly random; it's just a matrix of small values that shouldn't be taken into account. Subtracting them cancels them out. As pointed out by a different comment, it's actually the attention we are interested in that is cancelled out *if they are both equal*. This is what the paper mentions in its abstract; > promoting the emergence of sparse attention patterns In theory, it is quite clever, and their results seem to back it up. |
|