| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Squarex 54 days ago
	I would say all benchmarks are inherently subjective. How is yours better? It seems to produce a little bit strange results. Opus 4.6 being worse than 4.5 for example. Or chinese models being rated too high. Kimi, Deepseek or GLM are all great in open source world, but I don't believe they are ahead of SOTA models from Anthropic, OpenAI or Google.

3 comments

gertlabs 54 days ago

No, some benchmarks are definitely objective, but most can be easily gamed. For example, most of the benchmarks on the model cards: they have measurable answers that don't rely on a human judge (a human made the question, but the answers are measuring some uncontroversial knowledge or capability). But because there is a single, correct answer, and those answer leak (or are randomly discovered and optimized for in training), they lose value over time, and regardless, they have a ceiling on the intelligence they can measure.

Others are purely subjective, like LMArena, which really only measures the personality and style preferences of the masses at this point, because frontier LLM technical answers are too hard for the average person to judge.

Then there are some interesting one-off benchmarks, but they lack enough rigor, breadth, and samples to draw larger conclusions from.

So we designed our benchmark with 3 goals: objective measurements (individual submissions not dependent on a human or LLM judge), no known correct answer (so simulations can scale to much higher levels of intelligence), and enough variety over important aspects of intelligence. We do this by running multiple models in cooperative/competitive environments with very complex action spaces and objective scoring, where model performance is relative and affected by the actions of other participants.

And yeah, there are some interesting results when you have a more objective benchmark. It should raise eyebrows when every single sub-release of every company's model is better across the board than its predecessor -- that isn't reality.

link

Squarex 54 days ago

The word "objective" just seems too authoritative to me.

link

tw1984 54 days ago

I agree that benchmarks are inherently subjective.

but the fact that you cite your brief as your main argument is funny - you don't even have any inherently subjective numbers to justify what you believe, you only have "I don't believe".

link

Squarex 53 days ago

Sure, I have mixed up two things together. I don't think this benchmark is bad, I just did not like it is presented as the ultimate objective truth. The other thing I have mentioned is that it delivers different results from other benchmarks, so the "believe" stems from other benchmarks.

link

segmondy 54 days ago

you are arguing with your belief instead of an objective truth. benchmark is more objective, if you don't agree with it, come up with a better one. but what you believe doesn't matter.

link

Squarex 54 days ago

It was not a confrontational take. But all benchmarks are designed by humans, we are not that great at measuring intelligence. So it is somewhat subjective. I was just arguing with the word "objective". Not with the results per se.

link

swiftcoder 54 days ago

If the benchmark has a correct answer, the benchmark itself is an objective measure (but of what?). The "of what" may well be subjective

link

orbital-decay 52 days ago

Only if the benchmark is private and done properly on relevant tasks, which is rarely the case. I can guarantee that you have a ton of blind spots if you look at it through the lens of a ranking ladder in some generic tasks.

link