|
|
|
|
|
by chvid
390 days ago
|
|
Benchmarks seem like a fools errand at this point; overly tuning models just to specific test already published tests, rather than focusing on making them generalize. Hugging face has a leader board and it seems dominated by models that are finetunings of various common open source models, yet don't seem be broader used: https://huggingface.co/open-llm-leaderboard |
|
- live benchmarks (livebench, livecodebench, matharena, SWE-rebench, etc)
- benchmarks that do not have a fixed structure, like games or human feedback benches (balrog, videogamebench, arena)
- (to some extent) benchmark without existing/published answers (putnambench, frontiermath). You could argue that someone could hire people to solve those or pay off benchmark dev, but it's much more complicated.
Most of the benchmarks that don't try to tackle future contamination are much less useful, that's true. Unfortunately, HLE kind of ignored it (they plan to add a hidden set to test for contamination, but once the answers are there, it's a lost game IMHO); I really liked the concept.
Edit: it is true that these benchmarks are focusing only on a fairly specific subset of the model capabilities. For everything else vibe check is your best bet.