Hacker News new | ask | show | jobs
by riku_iki 806 days ago
I didn't get one detail: they selected 6B transformer as baseline and compared it to 7B Griffin

Why wouldn't select equal size models?..

1 comments

They probably had them for some reason and it was cheaper not to retrain one of them again
Its just performance comparison is misleading then, they report marginal improvements which is expected just because of models size differences..
It also performs better on any other size.
They have baseline transformer of max size 6B in tables. Other models are trained on very different data and probably differently.
All the MQA transformers, Hawk and Griffin are trained on the same MassiveText dataset so no.
Yes, but MQA is limited to 6B size, while "other" larger non-RNN models in table(Llama-2) are not trained on the same dataset, and Hawk and Griffin are 7B. Sorry, I don't understand your point.