How do you determine the multiplier. Because e.g. there are many problems that GPT4 can solve while GPT3.5 can't. In this case it is infinitely better.
Let's say your benchmark gets you at 60% with a 70b parameter model and you get to 65% with a 405b one, it's fairly obvious that it's just incremental progress, not a sustainable growth of capabilities per added parameter. Also, most of the data used these days for trainings these very large models is synthetic data, which is probably very low quality overall compared to human-sourced data.
But so if there's a benchmark that a model scores at 60%, does it mean that it's literally impossible to make anything that could be more than 67% better?
E.g. if someone scores 60% at a high school exam, is it impossible for anyone to be more than 67% smarter than this person at that subject?
Then what if you have another benchmark where GPT3.5 scores 0%, but GPT4 scores 2%. Does it make GPT4 infinitely better?
E.g. supposedly there was one LLM that did 2% in FrontierMath.