| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gertlabs 60 days ago
	A better benchmark needs to be objectively scored, have multi-disciplinary, breadth, and be scalable (no single correct answer). That's what we designed at https://gertlabs.com. We put a lot of thought into it, and kept it mostly (not fully) related to problem solving through coding.

3 comments

orangebread 60 days ago

Wow. This benchmark definitely feels more accurate than the other rankings I've seen. My experience with gpt 5.4/5.5 is that they are technically flawless and if there are any technical issues that is because the input didn't provide enough clarity; that's not to say that it doesn't autonomously react to any issues during bug fixes or implementations, but it'll tend to nail its tasks without leaving behind gaps.

Opus otoh is overrated in terms of its technical ability. It is certainly a better designer/developer for beautiful user experiences, but I'll always lean on gpt 5.5 to check its work.

The biggest surprise in the benchmark is Xiao-Mi. I haven't tried it yet, but I will be after looking at this.

Grats on your team for putting together something meaningful to make sense of the ongoing AI speedrun! Great work!

link

euleriancon 60 days ago

Are we looking at the same data? On that site I see that opus 4.7's and gpt 5.5's g scores are within each others confidence intervals, and both significantly ahead of the number 3 model.

Your comment makes it sound like they are miles apart, which the benchmark doesn't seem to support.

Edit: I looked at the data more and the two models are only basically equal when looking at the mean of all the tests. Gpt 5.5 significantly outperforms opus 4.7 in coding, while opus 4.7 significantly outperforms in "decision making." I'm not seeing details on what decision making explicitly means.

link

gertlabs 60 days ago

Decision making refers to the environments where the LLM is called on every tick (like games with social communication), examples here: https://gertlabs.com/spectate.

Because GPT 5.5 just launched and those games take longer to accumulate data for, it just doesn't have enough samples yet. It will end up with a wider lead on Opus, I am sure. Coding evals always have large sample sizes on day 1. Good find, we should probably better adjust the weighting here for decision games with low match counts.

link

orangebread 60 days ago

Right, I'm including my own observations in what the leaderboard is showing. Could be confirmation bias, but I use both Opus and GPT extensively and since GPT 5.4 I have noticed that Opus doesn't even begin to touch GPT's level of technical depth. I was hoping Opus 4.7 would close that gap, but unfortunately it doesn't even compare to GPT 5.4 in that sense.

I'm not being a hater, I love Opus for different reasons, but I can't rely on it for its technical ability.

link

gertlabs 60 days ago

Much appreciated! MiMo V2.5 Pro is by far the most underrated recent release (probably because it wasn't open weights from the start).

link

yalok 60 days ago

amazing to see Claude Code top models still way above all other models for C++ & Java, while GPT 5.5 is higher in Python & JS and others. Shows the skew in the training data sets, and maybe the go-to-market focus - with Anthropic focusing on enterprise customers much more than OpenAI?

Matches with my experience with Opus for C++.

C# results are empty - @gertlabs - any ETA for those?

link

gertlabs 59 days ago

C# testing is a new feature added a few days ago from HN comment suggestions, samples will continue growing. Most C# data is currently for non-agentic workloads: https://gertlabs.com/?mode=oneshot_coding

link

monlockandkey 60 days ago

Your benchmark suggests Deepseek V4 pro performs worse than Deepseek V4 flash? That is in an interesting result. Any comments on that outcome?

link

gertlabs 60 days ago

It's a surprising result, and a lot of it stems from the Pro variant struggling with our custom harness in agentic tasks (whereas Flash does fine), as well as provider instability. Failed requests are not counted against the model in its score, but it's possible there are additional silent degradations even on successful requests.

Either that, or Flash is truly a better architecture and the Pro variant is heavily benchmaxxed. It wouldn't be the first time we saw something like that in our benchmarking. We collect samples every week so it'll be interesting to see if it rebalances over time as new providers host the model. Flash is great though; it's so fast and cheap.

link