|
|
|
|
|
by nylonstrung
255 days ago
|
|
Read a study called "The Leaderboard Illusion" which credibly alleged that Meta Google OpenAI and Amazon got unfair treatment from LM Arena that distorted the benchmarks They gave them special access to privately test and let them benchmark over and over without showing the failed tests Meta got to privately test Llama 4 27 times to optimize it for high benchmark scores and then was allowed to report the only the highest cherry picked benchmark Which makes sense because in real world applications Llama is recognized to be markedly inferior to models that scored lower |
|
Not that it makes LMArena a perfect benchmark. By now, everyone who wanted to push LMArena ratings at any cost knows what the human evaluators there are weak to, and what should they aim for.
But your claim of "we know that ChatGPT, Google, Grok and Claude have explicitly gamed <benchmarks> to inflate their capabilities" still has no leg to stand on.