| There are quite a few benchmarks for which that's not the case: - live benchmarks (livebench, livecodebench, matharena, SWE-rebench, etc) - benchmarks that do not have a fixed structure, like games or human feedback benches (balrog, videogamebench, arena) - (to some extent) benchmark without existing/published answers (putnambench, frontiermath). You could argue that someone could hire people to solve those or pay off benchmark dev, but it's much more complicated. Most of the benchmarks that don't try to tackle future contamination are much less useful, that's true. Unfortunately, HLE kind of ignored it (they plan to add a hidden set to test for contamination, but once the answers are there, it's a lost game IMHO); I really liked the concept. Edit: it is true that these benchmarks are focusing only on a fairly specific subset of the model capabilities. For everything else vibe check is your best bet. |
Of course, some benchmarks are still valid and will remain valid. Ie. we can make the models play chess against each other and score them on how well they do. But those benchmarks are in general fairly narrow. They don't really measure the "broader" intelligence we are after. And often, LLMs perform worse than specialized models. Ie. I don't think there is any LLM out there that can beat a traditional chess program (surely not using the same computing power).
What is really bad are the QA benchmarks which leak over time into the training data of the models. And sometimes, one can suspect even big labs have an economic incentive in scoring well on popular benchmarks which cause them to manipulate the models way beyond what is reasonable.
And taking a bunch of flawed benchmarks and combining them in indexes, saying this model is 2% better than that model is just completely meaningless but of course fun and draws a lot of attention.
So, yes, we are kind of left with vibe checks, but in theory, we could do more; take a bunch of models, double-blind, and have a big enough, representative group of human evaluators score them against each other on meaningful subjects.
Of course, done right, that would be really expensive. And those sponsoring might not like the result.