| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by henriquegodoy 322 days ago

That SWE-bench chart with the mismatched bars (52.8% somehow appearing larger than 69.1%) was emblematic of the entire presentation - rushed and underwhelming. It's the kind of error that would get flagged in any internal review, yet here it is in a billion-dollar product launch. Combined with the Bernoulli effect demo confidently explaining how airplane wings work incorrectly (the equal transit time fallacy that NASA explicitly debunks), it doesn't inspire confidence in either the model's capabilities or OpenAI's quality control.

The actual benchmark improvements are marginal at best - we're talking single-digit percentage gains over o3 on most metrics, which hardly justifies a major version bump. What we're seeing looks more like the plateau of an S-curve than a breakthrough. The pricing is competitive ($1.25/1M input tokens vs Claude's $15), but that's about optimization and economics, not the fundamental leap forward that "GPT-5" implies. Even their "unified system" turns out to be multiple models with a router, essentially admitting that the end-to-end training approach has hit diminishing returns.

The irony is that while OpenAI maintains their secretive culture (remember when they claimed o1 used tree search instead of RL?), their competitors are catching up or surpassing them. Claude has been consistently better for coding tasks, Gemini 2.5 Pro has more recent training data, and everyone seems to be converging on similar performance levels. This launch feels less like a victory lap and more like OpenAI trying to maintain relevance while the rest of the field has caught up. Looking forward to seeing what Gemini 3.0 brings to the table.

7 comments

kkukshtel 322 days ago

You're sort of glossing over the part where this can now be leveraged as a cost-efficient agentic model that performs better than o3. Nobody used o3 for sw agent tasks due to costs and speed, and this now substantially seems to both improve on o3 AND be significantly cheaper than Claude.

synapsomorphy 322 days ago

o3's cost was sliced by 80% a month or so ago and is also cheaper than Claude (the output is even cheaper than GPT-5). It seems more cost efficient but not by much.

BoorishBears 322 days ago

This feels revisionist: no one used it because it wasn't as good.

extr 322 days ago

O3 is fantastic at coding tasks, until today it was smartest model in existence. But it works only in few shot conversational scenarios, it's not good at agentic harnesses.

m3kw9 322 days ago

You can use o3 for coding on plus plan almost unlimited or till they throttle

withinboredom 322 days ago

not anymore

m3kw9 320 days ago

what do you mean? For CLI or web codex?

slashdave 322 days ago

GPT-5 had to be released, in any form. This announcement was not the product of a breakthrough, but the consequence of a business requirement.

zmmmmm 322 days ago

this is the real answer

it has to be released because it's not much better and OpenAI needs the team to stop working on it. They have serious competition now and can't afford to burn time / money on something that isn't shifting the dial.

IceDane 322 days ago

The whole presentation was full of completely broken bar charts. Not even just the typical "let's show 10% of the y axis so that a 5% increase looks like 5x" but stuff like the deception eval showing gpt5 vs o3 as 50 vs 47, but the 47 is 3x as big, and then right next to it we have 9 vs 87, more reasonably sized.

It's like no one looked at the charts, ever, and they just came straight from.. gpt2? I don't think even gpt3 would have fucked that up.

I don't know any of those people, but everyone that has been with OAI for longer than 2 years 1.5m bonuses, and somehow they can't deliver a bar chart with sensible at axes?

throwaway_2898 322 days ago

TBH Claude Code max pro's performance on coding has been abhorrent(bad at best). The core of the issue is that the plan produced will more often than not use humans as verifiers(correctness, optimality and quality control). This is a fundamentally bad way to build systems that need to figure out if their plan will work correctly, because an AI system needs to test many plans quickly in a principled manner(it should be optimal and cost efficient).

So you might get that initial MvP out the door quickly, but when the complexity grows even just a little bit, you will be forced to stop and look at the plan and try to get it to develop it saying things like: "use Design agent to ultrathink about the dependencies of the current code change on other APIs and use TDD agent to make sure tests are correct in accordance with the requirements I stated" and then one finds that even the all the thinking there are bugs that you will have to fix.

Source: I just tried max pro on two client python projects and it was horrible after week 2.

z7 322 days ago

>The actual benchmark improvements are marginal at best

GPT-5 demonstrates exponential growth in task completion times:

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...

hk__2 322 days ago

What do you mean? A single data point cannot be exponential. What the blog post say is that the ability to solve tasks of all LLMs is exponential over time, and GPT-5 fits in that curve.

z7 322 days ago

Yes, but the jump in performance from o3 is well beyond marginal while also fitting an exponential trend, which undermines the parent's claim on two counts.

adammarples 322 days ago

Actually a single data point fits a huge range of exponential functions.

usaar333 322 days ago

No it doesn't. If it were even linear compared to o1 -> o3, we'd be at 2.43 hours. Instead we're only at 2.29.

Exponential would be at 3.6 hours

rrrrrrrrrrrryan 322 days ago

I suspect the vast majority of OpenAI's users are only using ChatGPT, and the vast majority of those ChatGPT users are only using the free tier.

For all of them, getting access to full-blown GPT-5 will probably be mind-blowing, even if it's severely rate-limited. OpenAI's previous/current generation of models haven't really been ergonomic enough (with the clunky model pickers) to be fully appreciated by less tech-savvy users, and its full capabilities have been behind a paywall.

I think that's why they're making this launch is a big deal. It's just an incremental upgrade for the power users and the people that are paying money, but it'll be a step-change in capability to everyone else.

mlsu 322 days ago

They are selling "AGI"

replacing huge swathes of the white collar workforce

"incremental upgrade for power users" is not at all what this house of cards is built on

Sabinus 322 days ago

They are selling AGI to investors, but they're just selling intelligence to subscribers. And they just made the intelligence cheaper and better.

m3kw9 322 days ago

I’m very seen ppl minds blown on free tier previous to 5. It’s basically 4o which is pretty good for normies

samsullivan 322 days ago

Thats why they need to pay 300k for a slide designer https://openai.com/careers/creative-lead-presentation-design...