|
|
|
|
|
by albertzeyer
485 days ago
|
|
I also just read that paper. But I wonder, even though MLA is strictly more powerful, do you really gain by that in experiments? This paper doesn't really do too much experimental comparisons. GQA on the other side should still be faster (no need to an extra linear transformation). |
|