| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by strange_quark 137 days ago
	ChatGPT 3.5 was almost 40 months ago, not 24. GPT 4.5 was supposed to be 5 but was not noticeably better than 4o. GPT 5 was a flop. Remember the hype around Gemini 3? What happened to that? Go back and read the blog posts from November when Opus 4.5 came out; even the biggest boosters weren't hyping it up as much as they are now. It's pretty obvious the change of pace is slowing down and there isn't a lot of evidence that shipping a better harness and post-training on using said harness is going to get us to the magical place where all SWE is automated that all these CEOs have promised.

2 comments

tibbar 137 days ago

Wait, you're completely skipping the emergence of reasoning models, though? 4.5 was slower and moderately better than 4o, o3 was dramatically stronger than 4o and GPT5 was basically a light iteration on that.

What's happening now is training models for long-running tasks that use tools, taking hours at a time. The latest models like 4.6 and 5.3 are starting to make good on this. If you're not using models that are wired into tools and allowed to iterate for a while, then you're not getting to see the current frontier of abilities.

(EG if you're just using models to do general knowledge Q&A, then sure, there's only so much better you can get at that and models tapered off there long ago. But the vision is to use agents to perform a substantial fraction of white-collar work, there are well-defined research programmes to get there, and there is stead progress.)

link

strange_quark 137 days ago

> Wait, you're completely skipping the emergence of reasoning models, though?

o1 was something like 16-18 months ago. o3 was kinda better, and GPT 5 was considered a flop because it was basically just o3 again.

I’ve used all the latest models in tools like Claude code and codex, and I guess I’m just not seeing the improvement? I’m not even working on anything particularly technically complex, but I still have to constantly babysit these things.

Where are the long-running tasks? Cursor’s browser that didn’t even compile? Claude’s C compiler that had gcc as an oracle and still performs worse than gcc without any optimizations? Yeah I’m completely unimpressed at this point given the promises these people have been making for years now. I’m not surprised that given enough constraints they can kinda sorta dump out some code that resembles something else in their training data.

link

onion2k 137 days ago

Fair enough, I guess I'm misremembering the timeline, but saying "It's taken 3 years, not 2!" doesn't really change the point I'm making very much. The road from what ChatGPT 3.5 could do to what Codex 5.3 can do represents an amazing pace of change.

I am not claiming it's perfect, or even particularly good at some tasks (pelicans on bicycles for example), but anyone claiming it isn't a mind-blowing achievement in a staggeringly short time is just kidding themselves. It is.

link