| HN Mirror

> so if some prediction task requires long range dependencies you might expect one of the less-penalized heads to end up specializing Exactly. You have heads that focus on content nearby and ones that focus on stuff that is far away.

> I wonder what would happen if the penalty for one head is zero? (The paper suggests this might've been tried and just made things worse, but unclear) Yup, this is something we tried. Making one of the heads zero doesn't improve or degrade performance.

>I must admit that this is a wonderfully elegant (and interpretable) way to do this... much more intuitive (to me at least, a wannabe practitioner) than all of the trig-based embeddings.

Thanks so much!!