Hacker News new | ask | show | jobs
by aDyslecticCrow 617 days ago
The noise isn't truly random; it's just a matrix of small values that shouldn't be taken into account. Subtracting them cancels them out.

As pointed out by a different comment, it's actually the attention we are interested in that is cancelled out *if they are both equal*. This is what the paper mentions in its abstract;

> promoting the emergence of sparse attention patterns

In theory, it is quite clever, and their results seem to back it up.