Hacker News new | ask | show | jobs
by Translationaut 853 days ago
This seems only to work cause large GPTs have redundant, undercomplex attentions. See this issue in BertViz about attention in Llama: https://github.com/jessevig/bertviz/issues/128