|
|
|
|
|
by ethan_smith
322 days ago
|
|
Attention weights can still assign non-zero probability to irrelevant tokens since the mechanism optimizes for prediction rather than semantic relevance, and these irrelevant tokens can create interference in the hidden state representations. |
|