|
|
|
|
|
by newhouseb
1139 days ago
|
|
Small world! Thanks for the link (which I've now skimmed beyond the abstract). What wasn't obvious to me from the abstract is that different attention heads have different penalty strengths, so if some prediction task requires long range dependencies you might expect one of the less-penalized heads to end up specializing. I wonder what would happen if the penalty for one head is zero? (The paper suggests this might've been tried and just made things worse, but unclear) I must admit that this is a wonderfully elegant (and interpretable) way to do this... much more intuitive (to me at least, a wannabe practitioner) than all of the trig-based embeddings. |
|
> I wonder what would happen if the penalty for one head is zero? (The paper suggests this might've been tried and just made things worse, but unclear) Yup, this is something we tried. Making one of the heads zero doesn't improve or degrade performance.
>I must admit that this is a wonderfully elegant (and interpretable) way to do this... much more intuitive (to me at least, a wannabe practitioner) than all of the trig-based embeddings.
Thanks so much!!