|
|
|
|
|
by spidersouris
990 days ago
|
|
The paper published by Xiao et al. (2023)[0] states that "a surprisingly large amount of attention score is allocated to the initial tokens, irrespective
of their relevance to the language modeling task" (p. 2). Does that mean that task prefixes used for LLM generation (e.g. "translate: [sentence]") are actually attention sinks? Or are they not? I don't really understand what they mean by "irrespective of their relevance to the language modeling task." [0] https://arxiv.org/pdf/2309.17453.pdf |
|
The clue is really that these tokens are just used to "offload" attention scores - their semantic meaning is irrelevant.