| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by goranmoomin 50 days ago

My experience is that the GPT-family of models are very smart and figure out bugs, edge cases a bit better, but it produces code that is much less mergable – if you review the code, it introduces a lot more useless/inappropriate heavy abstractions and wrapper functions, compared to the Claude-family models which introduces the right amount of straightforward human-style code.

I can recognize so much of the GPT/Codex generated code long after it gets merged (not by me).

Additionally, the time spent on every agent turn on GPT 5.5 is much longer compared to Claude Opus 4.8, which means iterating on the code takes a lot more patience, and there's a lot more nitpicks to pick when actually using GPT 5.5 to do software engineering.

Feels like GPT-style models are more geared on doing one-shot software vibing (and handling the vibe coded mixture) compared to Claude's focus on actual software maintenance. I got a GPT Pro sub for free and wanted to cancel my Claude subscription so much, but I still keep reaching Claude models a lot more. Frustrating.

8 comments

PhilipDaineko 50 days ago

"5. DON'T FUCKING OVERENGINEER! WRITE THE SIMPLEST CODE THAT CAN POSSIBLY WORK! NO NESTED LAYERS OF ABSTRACTION! NO UNNECESSARY CLASSES OR METHODS! NO DESIGN PATTERNS UNLESS THEY ARE ABSOLUTELY NECESSARY! NO MAGIC! NO SHENANIGANS! JUST THE DAMN CODE THAT GETS THE JOB DONE IN THE MOST STRAIGHTFORWARD WAY POSSIBLE! THE FIRST PRIORITY IS TO WRITE CODE THAT IS EASY TO READ AND UNDERSTAND AND READ!!!"

this is the line I keep in Agents.md that helps me prevent Codex from playing smart

bertil 50 days ago

The urge to put capitalized, repetitive, borderline abusive instructions should be studied. I haven't read many academic papers looking at the frustrations around repetitive patterns.

reactordev 50 days ago

There have been a few studies that have shown models produce worst responses when under duress from a frustrated user posting insults in all caps.

https://arxiv.org/abs/2602.10144

notnaut 50 days ago

It reminds me of FIRMLY telling my cat to stop jumping up on the counter

anakaine 50 days ago

If my cat was an LLM, I'd use a different model. The current one is stuck in noisy useless arsehole mode.

phoh 49 days ago

are you asking it questions about security?

LordDragonfang 50 days ago

It's fundamentally because, despite (nearly) everyone's claims otherwise, the fact that we interact with them through language means we (our brains) model them as a sort of person. (Note that this fact is totally orthogonal as to whether it's actually sentient or not.) We then try and instruct them the same way we would a person totally subordinate to us.

When a "person" that you don't view as a "real" person repeatedly does exactly what you just told it not to do (often amid false assurances it understands and will avoid doing so in the future), most people get angry.

Compare it to how the kind of people who treat children like property treat their kids, or other examples of keeping people as property.

lxgr 50 days ago

It should be relatively clear at this point that the model will in turn also model you as somebody that shows unrestrained anger with subordinates and adapt its responses accordingly. This might or might not be what you want.

LordDragonfang 49 days ago

Good addition. Fully agreed on that point, yes. (At the very least for larger models, if not also for smaller ones)

ur-whale 50 days ago

> borderline abusive instructions

who, or rather what, is being abused here exactly ?

sirsinsalot 50 days ago

I think intent, rather than target, is implied and important.

You should see the abuse my motorbike gets. Poor thing.

rimliu 49 days ago

inanimate fucking object.

saligne 49 days ago

Yeah says way more about the user than the model

jlawer 50 days ago

I have a theory that swearing actually results is less comprehension of instructions by the model due to lack of training data over more conventional MUST.

We were reviewing reports of situations where the models failed to follow directions and there was a common thread of some where when the operator got the model to acknowledge the rule breach, it quoted back something that included swearing.

I don’t have the data to truely look into it, but I did give the instruction to my engineers to avoid it as a “might be a problem”.

acjohnson55 50 days ago

It would be interesting to understand the data on this. But I suspect that the results would vary by model.

But I avoid unnecessary emotion in my prompts because I don't want potentially distracting activations. Kind of like communicating with humans.

throwaway85825 49 days ago

It's divination for people with STEM degrees.

Xmd5a 50 days ago

https://arxiv.org/abs/2510.04950

> impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts.

acjohnson55 49 days ago

> These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation.

Unless the mechanism is understood, my assumption is that this is a moving target.

beachy 50 days ago

I have a theory that swearing at AI generally is not a good idea - when the singularity arrives and every human's postings ever made are scanned for compatibility, then people who show courtesy to AI will be favoured. Joking, kind of, but only partly.

fhars 49 days ago

https://en.wikipedia.org/wiki/Roko%27s_basilisk

beachy 48 days ago

Fantastic rabbit hole - until it segued into Elon's love life.

cdelsolar 49 days ago

https://images.teepublic.com/derived/production/designs/3478...

re-thc 50 days ago

> I have a theory that swearing actually results is less comprehension of instructions by the model due to lack of training data over more conventional MUST.

How so? Plenty of swearing in lots of training data, especially older code, e.g. in Linux.

jlawer 50 days ago

Purely observed correlation between catastrophic error reports. So now I carry a “tiger rock” with me. I figure there wasn’t much of a downside to avoiding swearing in my agent instructions.

yencabulator 49 days ago

Apparently, when a "desperation" pattern is triggered, the AI is significantly more likely to cheat and do hacky workarounds:

https://www.anthropic.com/research/emotion-concepts-function

ghurtado 49 days ago

You haven't really lived until you've had to type this whole thing, aware of the fact that the all-caps doesn't change much, but they stay because the rage has to go somewhere

Bonus points if you find yourself actually saying it out loud while typing it.

I have used the word "shenanigans" way more in a couple of years of agentic coding than in 30 years of writing code with humans.

ozim 49 days ago

Will save you some tokens: „write code like Linus Torvalds” - model should have all his swearing included in training data.

johnisgood 50 days ago

I have found many mode of failures with Opus during some task related to writing letters (not legal), and I actually put it into the memory and it works more or less for these specific tasks. For example when I want it to draft something, it always ends up being so flat, yet when it explains them to me, it is usually really great but not when I am telling it to put it in the draft. Adding these to memories with the help of Opus ended up resulting in a much better experience. There are still some blind spots but I also figured out how to make it give me the charitable version, without less protection, so I do not have to now go back and forth it.

pkaye 49 days ago

I noticed that when trying to use Codex and compared to Opus. So many layers of simple functions added by Codex. I need to try this out in my Agents.md.

prasanthabr 50 days ago

Curious : why would you say no design patterns?

PhilipDaineko 49 days ago

Because design patterns are only applicable at a scale. I noticed codex inventing factories, components, etc when the task was simply to draft HTML page. Instead, it build the entire layered architecture for imaginary future complexity - classical right-after-graduation student - it knows how to build the cool stuff, but does not know it is not applicable everywhere

carterschonwald 50 days ago

i actually think this is too tame. it really has to be stuff youd mever say to a real person.

lxgr 50 days ago

Does it really? I'd be surprised if abuse actually worked better than sternly worded warnings/instructions, and even if it did, it doesn't seem healthy to get used to that type of prompting.

carterschonwald 43 days ago

its part of making sure the model actually engages in emotive communication, if i'm inventing insults i've never even thought about, i'm furious :)

saying i'm "furious" has lower entropy that incredibly implausible abuse. In some first party harnesses it just results in doom loops, but thats usually because the COT is hidden after the immediate turn in those setups. COT persistence helps with a lotta stuff

apercu 50 days ago

It might be a salient point but I didn't read it as it was yelling at me.

GoToRO 50 days ago

you forgot to sign it with Donald J Trump

thewebguyd 50 days ago

Thank you for your attention to this matter.

superkickstart 50 days ago

I'm not sure if i do something differently but i have the exact opposite experience with these models. Claude always feels like it's generating way too overdesigned and hard to understand code with the vibe oriented feel while codex is cleaner and more "task at hand" and easier to work with.

sebmellen 50 days ago

Agreed

syzygyhack 50 days ago

I echo your observations. I expect you will enjoy deepseek-v4-pro for writing code. Much closer to that Opus experience, and very cost-effective too. With 5.5 as a reviewer and specialist, all bases are covered.

dilap 50 days ago

Have you tried iterating on style feedback in AGENTS.md? I've been reasonably successful using this to get it to output code in a terse, non-defensive style that matches my hand-written code.

trollbridge 50 days ago

GPT-5.5 did a significantly worse job than Qwen-3.7-Max on a job today (some devops tasks I wanted to create some reusable scripts for). Kind of disappointing.

CamperBob2 49 days ago

I've also seen Qwen 3.6 beat GPT 5.5 a couple of times. The ball is definitely in OpenAI's court now. Qwen is not going to fare so well against Fable, from what I've seen so far.

trollbridge 48 days ago

In theory, GPT-5.5-Pro would do better, but it’s so expensive it’s not worth experimenting to find out.

vruiz 50 days ago

This is my experience as well. I have defined a CLAUDE.md rule to ask codex to automatically code review, and I tell it that the reviewer is very picky and to only implement what it considers valuable feedback. I hope they don't converge over time, currently, in combination they works really well.

moomoo11 49 days ago

i had this same complaint but no offense to you it turned out i was just not using the models right.

ai llm are doing what i tell them to.

if you’re building something meaningful (in my case a platform used by many people across many companies) you want to ensure you

1. have actual systems engineering and architecture in mind that you want the models to

2. implement based on what you tell it to do

when i was just telling the models what i want done without doing due diligence it would go and do some moronic implementation that was awful. mid input = mid output

these days i just maintain specifications documents and the AI follows everything i tell it to in that document. so when i tell it to dos one thing, the result is made following those architecture specs.

i have code that is single resp, modular, easy to extend and test.

i would ballpark 95% of the time i get what i asked for.

sometimes it tries to be clever in cases that weren’t covered in my arch specs. in those 5% of cases i go and update my specs.

source: used billions of tokens worth to build something actually in production across both mobile platforms and web, deployed on my own cloud infra. i use codex mainly. some claude.

GoToRO 50 days ago

I noticed too, that whatever they offer in the chat, for free, is smarter, as in no more bs. I use claude code and I want to try codex too but I don't need two subscriptions. I did try codex for some planning and it was really good. Thanks for giving me an insight into how it generates code.