| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by h4ny 307 days ago

I have been seeing different people reporting different results with different tasks. Watched a live stream that compared GPT-5, Gemini Pro 2.5, Claude 4 Sonnet, and GLM 4.5, and GPT-5 appeared to not follow instructions as well as the other three.

At the moment it feels like most people "reviewing" models depends on their believes and agenda, and there are no objective ways to evaluate and compare models (many benchmarks can be gamed).

The blurring boundaries between technical overview, news, opinions and marketing is truly concerning.

7 comments

epolanski 307 days ago

I will also state another semi-obvious thing that people seem to consistently forget: models are non deterministic.

You are not going to get the same output from GPT5 or Sonnet every time.

And this obviously compounds across many different steps.

E.g. give GPT5 the code to a feature (by pointing some files and tests) and tell it to review it and find improvement opportunities and write them down: depending on the size of the code, etc, the answers will slightly different.

I often do it in Cursor by having multiple agents review a PR and each of them: - has to write down their pr-number-review-model.md (e.g. pr-15-review-sonnet4.md) - has to review the reviews of the other files

Then I review it myself and try to decide what's valuable in there and what not. And to my disappointment (towards myself): - often they do point to valid flaws I would've not thought about - miss the "end-to-end" or general view of how the code fits in a program/process/business. What do I mean: sometimes the real feedback would be that we don't need it at all. But you need to have these conversations with AI earlier.

link

x187463 307 days ago

This has been ubiquitous for a while. Even here on HN every thread about these models (even this one, I'm sure) features an inordinate amount of disagreement between people vehemently declaring one model more useful than another. There truly seems to be no objective measurement of quality that can discern the difference between frontier models.

link

physix 307 days ago

I think this is actually good, because it means there is no clear winner who can sit back and demand rent. Instead they all work as hard as they can to stay competitive, hopefully thereby accelerating AI software engineering capabilities, with the investors footing the bill.

link

NitpickLawyer 307 days ago

Yeah, I agree. And prices are slowly coming down. Gemini 2.5 was cheaper than claude4, and (again depending on task) either on par or slightly below in quality. Now gpt5 is cheaper still (I think their -main is 10$/M?) and they also have -mini and -nano versions. The more choices we have the better it will be. As you said, without a clear winner we're about to get spoiled for choice, and there's no clear way for them to just sit on stuff and increase prices (yet). Plus there's some pressure coming from the open source releases. Not there in quality, but they are runnable "on prem", pretty cheap and keep getting better.

link

vineyardmike 307 days ago

> At the moment it feels like most people "reviewing" models depends on their believes and agenda, and there are no objective ways to evaluate and compare models

I think you’ll always have some disagreement generally in life, but especially for things like this. Code has a level of subjectivity. Good variable names, correct amount of abstraction, verbosity, over complexity, etc are at least partially opinions. That makes benchmarking something subjective tough. Furthermore, LLMs aren’t deterministic, and sometimes you just get a bad seed in the RNG.

Not only that, but the harness and prompt used to guide the model make a difference. Claude responds to the word “ultrathink”, but if GPT-5 uses “think harder”, then what should be in the prompt?

Anecdotally, I’ve had the best luck with agentic coding when using Claude Code with Sonnet. Better than Sonnet with other tools, and better than Claude Code with other models. But I mostly use Go and Dart and I aggressively manage the context. I’ve found GPTs can’t write zig at all, but Gemini can, but they can both write python excellently. All that said, if I didn’t like an answer, I’d prompt again, but liked the answer, never tried again with a different model to see if I’d like it even more. So it’s hard to know what could’ve been.

I’ve used a ton of models and harnesses. Cursor is good too, and I’ve been impressed with more models in cursor. I don’t get the hype of Qwen though because I’ve found it makes lots of small(er) changes in a loop, and that’s noisy and expensive. Gemini is also very smart but worse at following my instructions, but I never took the time to experiment with prompting.

link

jjfoooo4 307 days ago

There's certainly a symbiosis blog publishers and small startups wanting to be perceived as influential, and big companies releasing models and wanting favorable coverage.

I heavily discount same day commentary, there's a quid pro quo on early access vs favorable reviews (and yes, folks publishing early commentary aren't explicitly agreeing to write favorable things, but there's obvious bias baked in).

I don't think it's all particularly concerning, you can discount reviews that are coming out so quickly that's it's unlikely the reviewer has really used it very much.

link

muzani 307 days ago

If you were to objectively rank things, durian would be the best fruit in the world, python would be the best programming language, and the Tesla Model Y is the best car. Everyone has multiple inconsistent opinions on everything because everything is not the same.

Just pick something and use it. AI models are interchangeable. It's not as big a decision as buying a car or even a durian.

link

isaacremuant 307 days ago

> The blurring boundaries between technical overview, news, opinions and marketing is truly concerning.

Can't help but laugh at this. It's like you just discovered skepticism and how the world actually works.

link

qsort 307 days ago

Thankfully that isn't a problem: we have scientific and reliable benchmarks to cut through the nonsense! Oh wait...

link