Y
Hacker News
new
|
ask
|
show
|
jobs
by
miven
616 days ago
Is there an intuitive reason why this ends up working this well compared to, say, applying some kind of thresholding to attention activations that are below average for a given head to filter that same attention noise out?