| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pscanf 90 days ago

I quite like the GPT models when chatting with them (in fact, they're probably my favorites), but for agentic work I only had bad experiences with them.

They're incredibly slow (via official API or openrouter), but most of all they seem not to understand the instructions that I give them. I'm sure I'm _holding them wrong_, in the sense that I'm not tailoring my prompt for them, but most other models don't have problem with the exact same prompt.

Does anybody else have a similar experience?

7 comments

ilaksh 89 days ago

These little 5.4 ones are relatively low latency and fast which is what I need for voice applications. But can't quite follow instructions well enough for my task.

That's really the story of my life. Trying to find a smart model with low latency.

Qwen 3.5 9b is almost smart enough and I assume I can run it on a 5090 with very low latency. Almost. So I am thinking I will fine tune it for my application a little.

jauntywundrkind 89 days ago

I've had such the opposite experience, but mainly doing agentic coding & little chat.

Codex is an ice man. Every other model will have a thinking output that is meaningful and significant, that is walking through its assumptions. Codex outputs only a very basic idea of what it's thinking about, doesn't verbalize the problem or it's constraints at all.

Codex also is by far the most sycophantic model. I am a capable coder, have my charms, but every single direction change I suggest, codex is all: "that's a great idea, and we should totally go that [very different] direction", try as I might to get it to act like more of a peer.

Opus I think does a better job of working with me to figure out what to build, and understanding the problem more. But I find it still has a propensity for making somewhat weird suggestions. I can watch it talk itself into some weird ideas. Which at least I can stop and alter! But I find its less reliable at kicking out good technical work.

Codex is plenty fast in ChatGPT+. Speed is not the issue. I'm also used to GLM speeds. Having parallel work open, keeping an eye on multiple terminals is just a fact of life now; work needs to optimize itself (organizationally) for parallel workflows if it wants agentic productivity from us.

I have enormous respect for Codex, and think it (by signficiant measure) has the best ability to code. In some ways I think maybe some of the reason it's so good is because it's not trying to convey complex dimensional exploration into a understandable human thought sequence. But I resent how you just have to let it work, before you have a chance to talk with it and intervene. Even when discussing it is extremely extremely terse, and I find I have to ask it again and again and again to expand.

The one caveat i'll add, I've been dabbling elsewhere but mainly i use OpenCode and it's prompt is pretty extensive and may me part of why codex feels like an ice man to me. https://github.com/anomalyco/opencode/blob/dev/packages/open...

psandor 89 days ago

“every single direction change I suggest, codex is all: "that's a great idea, and we should totally go that [very different] direction", try as I might to get it to act like more of a peer.“

That’s not the model, that’s a personality setting you can change in the codex config file.

Set it to Pragmatic, and ask it (not command it) about your new direction in planning mode.

It will tell you if your idea is not good for the given project. It’s an excellent peer.

pscanf 89 days ago

> I've had such the opposite experience

Yeah, I've actually heard many other people swear by the GPTs / Codex. I wonder what factors make one "click" with a model and not with another.

> Codex is an ice man.

That might be because OpenAI hides the actual reasoning traces, showing just a summary (if I understood correctly).

kevinsync 89 days ago

OpenClaw guy (he's Austrian, it's relevant) much prefers Codex over Claude and articulated it as being due to Claude's output feeling very "American" and Codex's output feeling very "German", and I personally really agree with the sentiment.

As an American, Claude feels much more natural to me, with the same overly-optimistic "move fast, break things" ethos that permeates our culture. It takes bigger swings (and misses) at harder-to-quantify concepts than Codex, cuts corners (not intentionally, but it feels like a human who's just moving too fast to see the forest for the trees in the moment), etc. Codex on the other hand feels more grounded, more prone to trying to aggregate blind spots, edge cases, and cover the request more thoroughly than Claude. It's far more pedantic and efficient, almost humorless. The dude also claimed that most of the Codex team is European while Claude team is American, and suggested that as an influence on why this might be.

Anyways, I've found that if I force Claude and Codex to talk to each other, I can get way better results and consistency by using Claude to generate fairly good plans from my detailed requests that it passes to Codex for review and amendment, Claude incorporates the feedback and implements the code, then Codex reviews the commit and patches anything Claude misses. Best of both worlds. YMMV

pscanf 89 days ago

Oh, interesting perspective. I'm Italian, but from an Alpine valley not far from Austria, so I don't know what I should prefer. :D

But joking aside, putting it like that I'd think I'd prefer the German/Codex way of doing things, yet I'm in camp Claude. But I've always worked better with teammates that balance my fastidiousness, so maybe that's my answer.

aragonite 89 days ago

Claude Code now hides thinking as well unless you turn on an undocumented setting:

https://github.com/anthropics/claude-code/issues/31326#issue...

https://x.com/nummanali/status/2032451025500528687

thanhhaimai 89 days ago

Opinions are my own.

For agentic work, both Gemini 3.1 and Opus 4.6 passed the bar for me. I do prefer Opus because my SIs are tuned for that, and I don't want to rewrite them.

But ChatGPT models don't pass the bar. It seems to be trained to be conversational and role-playing. It "acts" like an agent, but it fails to keep the context to really complete the task. It's a bit tiring to always have to double check its work / results.

kraemahz 89 days ago

I find both Opus 4.6 and GPT-5.4 have weaknesses but tend to support each other. Someone described it to me jokingly as "Claude has ADHD and Codex is autistic." Claude is great at doing something until it gets done and will run for hours on a task without feedback, Codex is often the opposite: it will ask for feedback often and sometimes just stop in the middle of a task saying it's done with step 1 of 5. On the other hand, Codex is a diligent reviewer and will find even subtle bugs that Claude created in its big long-running "until its done" work mode.

CamperBob2 89 days ago

Seems like the diagnoses are backwards, in this case. Claude usually stays on task no matter what, but lately Opus 4.6 is showing signs of overuse. I never used to get overload/internal server error messages, but I've seen about a half-dozen of them today alone. And it has been prone to blowing off subtasks that I'd have expected it to resolve.

tom1337 90 days ago

Yea absolutely. I am using GPT 5.2 / 5.2 Codex with OpenCode and it just doesn't get what I am doing or looses context. Claude on the other side (via GitHub Copilot) has no problem and also discovers the repository on it's own in new sessions while I need to basically spoonfeed GPT. I also agree on the speed. Earlier today I tasked GPT 5.2 Codex with a small refactor of a task in our codebase with reasoning to high and it took 20 minutes to move around 20 files.

furyofantares 90 days ago

I don't know any reason to use 5.2, when 5.3 is quite a bit faster.

spiderfarmer 90 days ago

If using OpenAI models, use the Codex desktop app, it runs circles around OpenCode.

qaz_plm 89 days ago

Can you educate me as to what makes Codex app superior using the same GPT model in both? Thx in advance!

stavros 89 days ago

Usually it's the prompts, or the model is tuned to the specific first-party tools. Sometimes that gives an edge over the generic tools, unfortunately.

spiderfarmer 89 days ago

It's the harness.

nikanj 90 days ago

Same, and I can't put my finger on the "why" either. Plus I keep hitting guard rails for the strangest reasons, like telling codex "Add code signing to this build pipeline, use the pipeline at ~/myotherproject as reference" and codex tells me "You should not copy other people's code signing keys, I can't help you with this"

renewiltord 90 days ago

Are you requesting reasoning via param? That was a mistake I was making. However with highest reasoning level I would frequently encounter cyber security violation when using agent that self-modifies.

I prefer Claude models as well or open models for this reason except that Codex subscription gets pretty hefty token space.

pscanf 89 days ago

Yes, I think? But I was talking more specifically about using the models via API in agents I develop, not for agentic coding. Though, thinking about it, I also don't click with the GPT models when I use them for coding (using Codex). They just seem "off" compared to Claude.

renewiltord 89 days ago

I am also talking about agents I'm developing. They just happen to be self-modifying but they're not for agentic coding. You have to explicitly send the reasoning effort parameter. If you set effort to None (default for gpt-5.4) you get very low intelligence.

pscanf 89 days ago

Ah OK sorry, I misinterpreted. But yes, I double checked one case and I am indeed setting the parameter explicitly (defaulting to medium effort). But no luck. It feels like the model ignores what I'm telling it.

For example, I pass it a list of database collections and tools to search through them, ask a question that can very obviously be answered with them, and it responds with "I can’t tell yet from your current records" (just tested with GPT 5.4-mini).

But I've prodded it a bit more now, and maybe the model doesn't want to answer unless it can be very very confident of the answer it produces. So it's sort of a "soft refusal".

jorl17 89 days ago

I like GPT models in Codex, for a fully vibecoded experience (I don't look at code) for my side-projects. In there, they really get the job done: you plan, they say what they'll do, and it shows up done. It's rare I need to push back and point out bugs. I really can't fault them for this very specific use-case.

For anything else, I can't stand them, and it genuinely feels like I am interacting with different models outside of codex:

- They act like terribly arrogant agents. It's just in the way they talk: self-assured, assertive. They don't say they think something, they say it is so. They don't really propose something, they say they're going to do it because it's right.

- If you counter them, their thinking traces are filled with what is virtually identical to: "I must control myself and speak plainly, this human is out of his fucking mind"

- They are slow. Measurably slow. Sonnet is so much faster. With Sonnet models, I can read every token as it comes, but it takes some focusing. With GPT, I can read the whole trace in real-time without any effort. It genuinely gives off this "dumb machine that can't follow me" vibe.

- Paradoxically, even though they are so full of themselves, they insist upon checking things which are obvious. They will say "The fix is to move this bit of code over there [it isn't]" and then immediately start looking at sort of random files to check...what exactly?

- I feel they make perhaps as many mistakes as Sonnet, but they are much less predictable mistakes. The kind that leaves me baffled. This doesn't have to be bad for code quality: Sonnet makes mistakes which _might_ at points even be _harder_ to catch, so might be easier to let slip by. Yet, it just imprints this feeling of distrust in the model which is counter-productive to make me want to come back to it

I didn't compare either with Gemini because Gemini is a joke that "does", and never says what it is "doing", except when it does so by leaving thinking traces in the middle of python code comments. Love my codebase to have "But wait, ..." in the middle of it. A useless model.

I've recently started saying this:

- Anthropic models feel like someone of that level of intelligence thinking through problems and solving them. Sonnet is not Opus -- it is sonnet-level intelligence, and shows it. It approaches problems from a sensible, reasonably predictable way.

- Gemini models feel like a cover for a bunch of inferior developers all cluelessly throwing shit at the wall and seeing what sticks -- yet, ultimately, they only show the final decision. Almost like you're paying a fraudulent agency that doesn't reveal its methods. The thinking is nonsensical and all over the place, and it does eventually achieve some of its goals, but you can't understand what little it shows other than "Running command X" and "Doing Y".

On a final note: when building agentic applications, I used to prefer GPT (a year ago), but I can't stand it now. Robotic, mechanic, constantly mis-using tools. I reach for Sonnet/Opus if I want competence and adherence to prompt, coupled with an impeccable use of tools. I reach for Gemini (mostly flash models) if I want an acceptable experience at a fraction of the price and latency.

pscanf 89 days ago

> They act like terribly arrogant agents

Oh I feel that. I sometimes ask ChatGPT for "a review, pull no punches" of something I'm writing, and my god, the answers really get on my nerves! (They do make some useful points sometimes, though.)

> On a final note: when building agentic applications, I used to prefer GPT (a year ago), but I can't stand it now. Robotic, mechanic, constantly mis-using tools. I reach for Sonnet/Opus if I want competence and adherence to prompt, coupled with an impeccable use of tools. I reach for Gemini (mostly flash models) if I want an acceptable experience at a fraction of the price and latency.

Yeah, this has been almost exactly my experience as well.

baq 89 days ago

A bit off topic, but reading your post I suddenly realized that if I read it three years ago I’d assume you’re either insane or joking. The world moved fast looking back.

birdsongs 90 days ago

> cyber security violation

Would you mind expanding on this? Do you mean in the resulting code? Or a security problem on your local machine?

I naively use models via our Copilot subscription for small coding tasks, but haven't gone too deep. So this kind of threat model is new to me.

renewiltord 89 days ago

No, I mean literal API response. They think I'm using it to hack. See related Github issue: https://github.com/anomalyco/opencode/issues/15776

I don't use OpenCode but looks like it also triggered similar use. My message was similar but different.

birdsongs 89 days ago

Ahhh okay, I see. Thanks!

hermit_dev 89 days ago

I ran 5.4 Pro on some data analytics (admittedly it was 300+ pages). It took forever. Ran the same on Sonnet 4.6, night and day difference. I understand it's like using a V8 engine for a V4 task, but I was curious. These new models look promising though. I'd rather use something like a Haiku most of the time over the best rated. I'm not a rocket scientist or solving the mysteries of the universe. They seem to do a great job 80% of the time.