Hacker News new | ask | show | jobs
by ofirpress 1139 days ago
(I wrote ALiBi) You can read the paper here https://arxiv.org/abs/2108.12409

While intuitively it does seem like ALiBi would make it hard for the model to attend to things that are far away, in many scenarios we've tested with different models trained on different datasets, ALiBi always performs better than sinusoidal, rotary, and other embedding types, even when we're not using it to extrapolate to longer sequence lengths.

These findings have been confirmed by others, including by the BLOOM open source LM project.

1 comments

Small world!

Thanks for the link (which I've now skimmed beyond the abstract). What wasn't obvious to me from the abstract is that different attention heads have different penalty strengths, so if some prediction task requires long range dependencies you might expect one of the less-penalized heads to end up specializing. I wonder what would happen if the penalty for one head is zero? (The paper suggests this might've been tried and just made things worse, but unclear)

I must admit that this is a wonderfully elegant (and interpretable) way to do this... much more intuitive (to me at least, a wannabe practitioner) than all of the trig-based embeddings.

> so if some prediction task requires long range dependencies you might expect one of the less-penalized heads to end up specializing Exactly. You have heads that focus on content nearby and ones that focus on stuff that is far away.

> I wonder what would happen if the penalty for one head is zero? (The paper suggests this might've been tried and just made things worse, but unclear) Yup, this is something we tried. Making one of the heads zero doesn't improve or degrade performance.

>I must admit that this is a wonderfully elegant (and interpretable) way to do this... much more intuitive (to me at least, a wannabe practitioner) than all of the trig-based embeddings.

Thanks so much!!