| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mlmonkey 2 days ago
	It's funny how in their own graph, https://storage.googleapis.com/gweb-uniblog-publish-prod/ima... Gemini 3.5 Flash is beat hands down by both Opus 4.8 and GPT 5.5, and yet the graph is drawn as if Gemini wins ... :-D

6 comments

mroche 2 days ago

The graph has Gemini 3.5 Flash matching Sonnet 4.6, losing to Opus 4.8, and slightly behind GPT-5.5 by 0.3 points... That's not that much of a hands-down loss for Gemini for this specific workload benchmark.

The methodology used:

https://deepmind.google/models/evals-methodology/gemini-3-5-...

Methodology: All Gemini scores are pass @1 except where otherwise noted. "Single attempt" settings allow no majority voting or parallel test-time compute. All of the results are all run with the Gemini API for the model-id gemini-3.5-flash with default sampling settings unless indicated otherwise below. To reduce variance, we average over multiple trials for smaller benchmarks.

All the results for non-Gemini models are sourced from providers' self reported numbers unless otherwise mentioned below. For Claude Opus 4.7 , Sonnet 4.6, and GPT-5.5 we default to reporting maximum thinking/reasoning settings available, but when reported results are not available we use best available reasoning results.

link

sheept 2 days ago

It highlights the Gemini models blue since that's what the article is about. The bar heights seem consistent with the values.

link

namuol 2 days ago

They should be sorting the models by performance on the horizontal axis.

link

roygbiv2 1 day ago

Why would they make their own product look worse?

link

namuol 1 day ago

Didn’t say it would, it’s just a better way to illustrate how each model fares in comparison.

link

data-ottawa 2 days ago

I think 3.5 flash is trying to target agentic work, like Google Search or ADK (agent development kit) use cases.

It’s something cheap enough you’d put out in front of your customers, and Opus is expensive enough you wouldn’t.

link

gb2d_hn 2 days ago

It's honest - people who know what they are looking at will take speed and token costs into account. I don't use Gemini 3.5 for coding, but I use it as something in between a search engine and agent.

link

IncreasePosts 2 days ago

It's amazing how designers of charts trying to show their product is close to the leader always remember to start the axis at zero, and designers of charts trying to show how big their lead is always forget that

link

timacles 1 day ago

Promotions material isn’t a medium for scientific rigor

link

antonvs 1 day ago

> beat hands down

The difference from GPT 5.5’s score is 0.3 points, hardly “hands down”.

link