Hacker News new | ask | show | jobs
by miven 616 days ago
Is there an intuitive reason why this ends up working this well compared to, say, applying some kind of thresholding to attention activations that are below average for a given head to filter that same attention noise out?