Hacker News new | ask | show | jobs
by huevosabio 990 days ago
This seems to be largely enabled by the observation that Softmax has to add up to one. From quick a glance [1], the model tends to use the first token as a placeholder for cases when you don't need to attend any of the prior tokens.

The first time I read about this issue, that Softmax is somewhat flawed, was in a HN post by Evan Miller [2] where he observes that forcing attention heads to allocate all attention to prior tokens is wrong, and we should allow them to "not attend" by adding one to the softmax denominator.

I love that they found a way to capitalize on this observation without having to retrain models. However, I wonder how the models would look like if they followed Evan's suggestion!

[1] Their description of attention sinks:

```

To understand the failure of window attention, we find an interesting phenomenon of autoregressive LLMs: a surprisingly large amount of attention score is allocated to the initial tokens, irrespective of their relevance to the language modeling task, as visualized in Figure 2. We term these tokens “attention sinks". Despite their lack of semantic significance, they collect significant attention scores. We attribute the reason to the Softmax operation, which requires attention scores to sum up to one for all contextual tokens. Thus, even when the current query does not have a strong match in many previous tokens, the model still needs to allocate these unneeded attention values somewhere so it sums up to one. The reason behind initial tokens as sink tokens is intuitive: initial tokens are visible to almost all subsequent tokens because of the autoregressive language modeling nature, making them more readily trained to serve as attention sinks.

```

[2] https://news.ycombinator.com/item?id=36851494

2 comments

Actually, seems like they did try the suggestion out, basically by training a model with a dedicated sink token with all zeros.

The verdict seems to be that you still end up with other initial tokens being used as sinks, so it is better to have a dedicated sink token.

That was the first time I'd read about it on HN, but as pointed out on that HN post it wasn't the first time Softmax + 1 was proposed. And, AFAIK, it has never resulted in better performance in practice. Maybe Softmax + 1 works better for fiddling with the attention window after training, but I don't know if anyone has tested that at scale.