| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mynti 252 days ago
	It is interesting that the Gemini 3 beats every other model on these benchmarks, mostly by a wide margin, but not on SWE Bench. Sonnet is still king here and all three look to be basically on the same level. Kind of wild to see them hit such a wall when it comes to agentic coding

13 comments

Workaccount2 252 days ago

I think Anthropic is reading the room, and just going to go hard on being "the" coding model. I suppose they feel that if they can win that, they can get an ROI without having to do full blown multimodality at the highest level.

It's probably pretty liberating, because you can make a "spikey" intelligence with only one spike to really focus on.

link

aerhardt 252 days ago

Codex has been good enough to me and it’s much cheaper.

I code non-trivial stuff with it like multi-threaded code and at least for my style of AI coding which is to do fairly small units of work with multiple revisions it is good enough for me to not to even consider the competition.

Just giving you a perspective on how the benchmarks might not be important at all for some people and how Claude may have a difficult time being the definitive coding model.

link

enraged_camel 252 days ago

>> Codex has been good enough to me and it’s much cheaper.

It may be cheaper but it's much, much slower, which is a total flow killer in my experience.

link

dudeinhawaii 251 days ago

Not to start a war but I've had 'fast' Claude write reams of slop code that I then have had to work with Codex to remove. Add this to the pile of "yeah but I saw the opposite with <insert model>" - but that's been my 2 cents.

Putting the latest Gemini CLI through some tough code tasks (C++) for my project, I found it to be slower than even Codex but good quality.

The problem I have is skepticism. Gemini 2.5 Pro was amazing on release, I couldn't stop talking about it. And then it went to being worthless in my workflows after a few months. I suspect Google (and other vendors) do this bait and switch with every release.

Let me see the benchmarks in 3 months.

link

enraged_camel 251 days ago

Claude can definitely write a lot of not-great code, but IME that's easy enough to mitigate by having it write a planning document first, then implement it step by step based on a to-do list on that planning document. Cursor's plan mode works great for this. It lets you review the outline at the start, then review each bit as the model writes it.

That said, I haven't had a good experience with Claude Code for the reason you described. Maybe it's Cursor (or similar IDE) that makes the difference.

link

mock-possum 251 days ago

My issue with codex is needing to run it in wsl in windows, due to it spamming confirmation requests for running even the safest of commands (eg list directory contents, read file, git status) which in turn adds an extra layer of complexity hooking it up via MCP to anything running in windows outside of wsl (like say figma)

In Claude on the other hand, MCP connections really do seem to ‘just work’

link

htrp 252 days ago

more playing to their strengths. a giant chunk of their usage data is basically code gen

link

Miraste 252 days ago

It remains to be seen whether that works out for them, but it seems like a good bet to me. Coding is the most monetizatable use anyone has found for LLMs so far, and the most likely to persist past this initial hype bubble (if the Singularity doesn't work out :p).

link

vharish 252 days ago

From my personal experience using the CLI agentic coding tools, I think gemini-cli is fairly on par with the rest in terms of the planning/code that is generated. However, when I recently tried qwen-code, it gave me a better sense of reasoning and structure that geimini. Claude definitely has it's own advantages but is expensive(at least for some if not for all).

My point is, although the model itself may have performed in benchmarks, I feel like there are other tools that are doing better just by adapting better training/tooling. Gemini cli, in particular, is not so great looking up for latest info on web. Qwen seemed to be trained better around looking up for information (or to reason when/how to), in comparision. Even the step-wise break down of work felt different and a bit smoother.

I do, however, use gemini cli for the most part just because it has a generous free quota with very few downsides comparted to others. They must be getting loads of training data :D.

link

xnx 252 days ago

Gemini CLI is moving really fast. Noticeable improvements in features and functionality every week.

link

cmrdporcupine 251 days ago

Yeah, you can see this even by just running claude-code against other models. For example, DeepSeek used as a backend for CC tends to produce results mostly competitive with Sonnet 4.5 A lot is just in the tooling and prompting.

link

felipeerias 252 days ago

IMHO coding use cases are much more constrained by tooling than by raw model capabilities at the moment. Perhaps we have finally reached the time of diminishing returns and that will remain the case going forward.

link

_factor 252 days ago

This seems preferable. Wasting tokens on tools when a standardized, reliable interface to those tools should be all that's required.

The magic of LLMs is that they can understand the latent space of a problem and infer a mostly accurate response. Saying you need to subscribe to get the latest tools is just a sales tactic trained into the models to protect profits.

link

Palmik 252 days ago

Also does not beat GPT-5.1 Codex on terminal bench (57.8% vs 54.2%): https://www.tbench.ai/

I did not bother verifying the other claims.

link

HereBePandas 252 days ago

Not apples-to-apples. "Codex CLI (GPT-5.1-Codex)", which the site refers to, adds a specific agentic harness, whereas the Gemini 3 Pro seems to be on a standard eval harness.

It would be interesting to see the apples-to-apples figure, i.e. with Google's best harness alongside Codex CLI.

link

Palmik 252 days ago

All evals on Terminal Bench require some harness. :) Or "Agent", as Terminal Bench calls it. Presumably the Gemini 3 are using Gemini CLI.

What do you mean by "standard eval harness"?

link

lucassz 252 days ago

I think the point is that it looks like Gemini 3 was only tested with the generic "Terminus 2", whereas Codex was tested with the Codex CLI.

link

enraged_camel 252 days ago

Do you mean that Gemini 3 Pro is "vanilla" like GPT 5.1 (non-Codex)?

link

HereBePandas 252 days ago

Yes, two things: 1. GPT-5.1 Codex is a fine tune, not the "vanilla" 5.1 2. More importantly, GPT 5.1 Codex achieves its performance when used with a specific tool (Codex CLI) that is optimized for GPT 5.1 Codex. But when labs evaluate the models, they have to use a standard tool to make the comparisons apples-to-apples.

Will be interesting to see what Google releases that's coding-specific to follow Gemini 3.

link

embedding-shape 252 days ago

> But when labs evaluate the models, they have to use a standard tool to make the comparisons apples-to-apples.

That'd be a bad idea, models are often trained for specific tools (like GPT Codex is trained for Codex, and Sonnet has been trained with Claude Code in mind), and also vice-versa that the tools are built with a specific model in mind, as they all work differently.

Forcing all the models to use the same tool for execution sounds like a surefire way of getting results that doesn't represent real usage, but instead arbitrarily measure how well a model works with the "standard harness", which if people start caring about, will start to become gamed instead.

link

tosh 252 days ago

This might also hint at SWE struggling to capture what “being good at coding” means.

Evals are hard.

link

raducu 252 days ago

> This might also hint at SWE struggling to capture what “being good at coding” means.

My take would be that coding itself is hard, but I'm a software engineer myself so I'm biased.

link

Squarex 251 days ago

It is just Python and Django. It might indicate qualities in other technologies, but it is not very good benchmark.

link

JacobAsmuth 252 days ago

50% of the CLs in SWE-Bench Verified are the DJango codebase. So if you're a big contributor to Django you should care a lot about that benchmark. Otherwise the difference between models is +-2 tasks done correctly. I wouldn't worry too much about it. Just try it out yourself and see if its any better.

link

aoeusnth1 252 days ago

Their scores on SWE bench are very close because the benchmark is nearly saturated. Gemini 3 beats Sonnet 4.5 on TerminalBench 2.0 by a nice margin (54% vs. 43%), which is also agentic coding (CLI instead of python).

link

varispeed 252 days ago

Never got good code out of Sonnet. It's been Gemini 2.5 for me followed by GPT-5.x.

Gemini is very good a pointing out flaws that are very subtle and non noticeable at a first and second glance.

It also produces code that is easy to reason about. You can then feed it to GPT-5.x for refinement and then back to Gemini for assessment.

link

baq 252 days ago

I find Gemini 2.5 pro to be as good or in some cases better for SQL than GPT 5.1. It's aging otherwise, but they must have some good SQL datasets in there for training.

link

alyxya 252 days ago

I think Google probably cares more about a strong generalist model rather than solely optimizing for coding.

link

macrolime 252 days ago

Pretty sure it will beat Sonnet by a wide margin in actual real-world usage.

link

I_am_tiberius 251 days ago

I don't know if this is true but I believe Anthropic has for a long time illegally used user prompts for training, without user consent.

link

HereBePandas 252 days ago

[comment removed]

link

Palmik 252 days ago

The reported results where GPT 5.1 beats Gemini 3 are on SWE Bench Verified, and GPT 5.1 Codex also beats Gemini 3 on Terminal Bench.

link

HereBePandas 252 days ago

You're right on SWE Bench Verified, I missed that and I'll delete my comment.

GPT 5.1 Codex beats Gemini 3 on Terminal Bench specifically on Codex CLI, but that's apples-to-oranges (hard to tell how much of that is a Codex-specific harness vs model). Look forward to seeing the apples-to-apples numbers soon, but I wouldn't be surprised if Gemini 3 wins given how close it comes in these benchmarks.

link

Palmik 252 days ago

All evals on Terminal Bench require some harness. :) Or "Agent", as Terminal Bench calls it. Presumably the Gemini 3 are using Gemini CLI.

link

jbellis 252 days ago

swebench is (1) terrible and (2) saturated

link