Y
Hacker News
new
|
ask
|
show
|
jobs
by
riku_iki
806 days ago
I didn't get one detail: they selected 6B transformer as baseline and compared it to 7B Griffin
Why wouldn't select equal size models?..
1 comments
szundi
806 days ago
They probably had them for some reason and it was cheaper not to retrain one of them again
link
riku_iki
806 days ago
Its just performance comparison is misleading then, they report marginal improvements which is expected just because of models size differences..
link
GaggiX
806 days ago
It also performs better on any other size.
link
riku_iki
806 days ago
They have baseline transformer of max size 6B in tables. Other models are trained on very different data and probably differently.
link
GaggiX
806 days ago
All the MQA transformers, Hawk and Griffin are trained on the same MassiveText dataset so no.
link
riku_iki
806 days ago
Yes, but MQA is limited to 6B size, while "other" larger non-RNN models in table(Llama-2) are not trained on the same dataset, and Hawk and Griffin are 7B. Sorry, I don't understand your point.
link