Hacker News new | ask | show | jobs
by menaerus 504 days ago
I didn't get the impression that the math around it is that simplistic. The first obvious reason I can think of now is the attention mechanism being used. Both GQA and MQA demand less compute and therefore less bandwidth than MHA.