|
|
|
|
|
by RoddaWallPro
68 days ago
|
|
5 years ago was the beginning of 2021, just under a year after GPT3 was released (which was not good at doing anything useful). And that model was 175B params. GPT4 has been widely rumored to have 1.8 trillion params, which is 10x more, and was released 2 years after this "5 years ago" date that you are using here. So, to quote yourself here, "This is not true and unfortunately this significantly reduced the credibility of this article for me" /s/article/comment |
|
Meanwhile, Gemma 2 9B, a model from July 2024 with 133x fewer parameters than GLaM, scores 82% and 80.6%. Hellaswag and WinoGrande aren't used in modern benchmarks, probably because they're too easy and largely memorised at this point.
And GPT-4 had 1.8T parameters sure, but it's noticeably worse than any modern model a fraction of the size, and the original incarnation was ridiculously expensive per token. And in any case, its number of parameters was only possible due using mixture-of-experts, which I would definitely classify as a sophisticated architecture as opposed to just throwing more parameters at a vanilla transformer. Even in 2021 GLaM was a MoE because the limits of scaling dense transformers had already been hit.