Hacker News new | ask | show | jobs
by cedws 130 days ago
I’m finding that the code LLMs produce is just average. Not great, not terrible. Which makes sense, the model is basically a complex representation of the average of its training data right? If I want what I consider ‘good code’ I have to steer it.

So I wouldn’t use LLMs to produce significant chunks of code for something I care about. And publishing vibe coded projects under my own GitHub user feels like it devalues my own work, so for now I’m just not publishing vibe coded projects. Maybe I will eventually, under a ‘pen name.’

3 comments

We've gone from "it's glorified auto-complete" to "the quality of working, end-to-end features, is average", in just ~2 years.

I think it goes without saying that they will be writing "good code" in short time.

I also wonder how much of this "I don't trust them yet" viewpoint is coming from people who are using agents the least.

Is it rare that AI one-shots code that I would be willing to raise as a PR with my name on it? Yes, extremely so (almost never).

Can I write a more-specified prompt that improves the AI's output? Also yes. And the amount of time/effort I spend iterating on a prompt, to shape the feature I want, is decreasing as I learn to use the tools better.

I think the term prompt-engineering became loaded to mean "folks who can write very good one-shot prompts". But that's a silly way of thinking about it imo. Any feature with moderate complexity involves discovery. "Prompt iteration" is more descriptive/accurate imo.

First you have to classify what “good code” is, something that programmers have still not settled on in the over half a century that the field has existed. I also think what the other reply said is true, going from average to “good code” is way harder because it implies a need for LLMs to self critique beyond what they do today. I don’t think just training on a set of hand picked samples is enough.

There’s also the knowledge cutoff aspect. I’ve found that LLMs often produce outdated Go code that doesn’t utilise the modern language features. Or for cases where it knows about a commonly used library, it uses deprecated methods. RAG/MCP can kind of paper over this problem but it’s still fundamental to LLMs until we have some kind of continuous training.

AI's can self-critique via mechanisms like chain of thought or user specified guard rails like a hook that requires the test suite to pass before a task can be considered complete/ready for human review. These can and do result in higher quality code.

Agree that "good code" is vague - it probably always be. But we can still agree that code quality is going up over time without having a complete specification for what defines "good".

Unfortunately I can only give anecdotes, but in my experience the LLM's 'thinking' does not lead to code quality improvements in the same way that a programmer thinking for a while would.

In my experience having LLMs write Go, it tends to factor code in not so great way from the start, probably due to lacking the mental model of pieces composing together. Furthermore, once a structure is in place, there doesn't seem to be a trigger point that causes the LLM to step back and think about reorganising the code, or how the code it wants to write could be better integrated into what's already there. It tends to be very biased by the structures that already exist and not really question them.

A programmer might write a function, notice it becoming too long or doing too much, and then decide break it down into smaller subroutines. I've never seen an LLM really do this, they seem biased towards being additive.

I believe good code comes from an intuition which is very hard to convey. Imprinting hard rules into the LLM like 'refactor long functions' will probably just lead to overcorrection and poor results. It needs to build its own taste for good code, and I'm not sure if that's possible with current technology.

> Furthermore, once a structure is in place, there doesn't seem to be a trigger point that causes the LLM to step back and think about reorganising the code, or how the code it wants to write could be better integrated into what's already there.

Older models did do this, and it sucked. You'd ask for a change to your codebase and they would refactor a chunk of it and make a bunch of other unrelated "improvements" at the same time.

This was frustrating and made for code that was harder to review.

The latest generation of models appear to have been trained not to do that. You ask for a feature, they'll build that feature with the least changes possible to the code.

I much prefer this. If I want the code refactored I'll say to the model "look for opportunities to refactor this" and then it will start suggesting larger changes.

> A programmer might write a function, notice it becoming too long or doing too much, and then decide break it down into smaller subroutines. I've never seen an LLM really do this, they seem biased towards being additive.

The nice thing is a programmer with an LLM just steps in here, and course-corrects, and still has that value add, without taking all the time to write the boilerplate in between.

And in general, the cleaner your codebase the cleaner LLM modifications will be, it does pick up on coding style.

>The nice thing is a programmer with an LLM just steps in here, and course-corrects

This does not seem to be the direction things are going. People are talking about shipping code they haven't edited, most notably the author of Claude Code. Sometimes they haven't even read the code at all. With LLMs the path of least resistance is to take your hands off the wheel completely. Only programmers taking particular care are still playing an editorial role.

When the code is constructed by an LLM, the human in the driving seat doesn't get a chance to build the mental models that they usually would writing it manually. This stifles the ability to see opportunities to refactor. It is widely considered to be harder to read code than to write it.

>And in general, the cleaner your codebase the cleaner LLM modifications will be

Whilst true, this is a kind of "you're holding it wrong" argument. If LLMs had model of what differentiates good code from bad code, whatever they pull into their context should make no difference.

Building expertise isn't a straight line. Going from a bad to average is much easier than going from average to good.
Yeah Tesla and Waymo know this quite well. There's a reason we don't have moon bases yet.
isnt ut more likely we are 80% of the way to maximum performance by doing 20 % of the work and the remaining tiny performance increase will require a multiple of the work we have done so far and will leave us with performamce that "isnt good enough"? Seems way more likely to me than a linear progression to agi from here
Is there a big enough dataset of 'good' code to train from though?
I (and lots of people) used to think the models would run out of training data and it would halt progress.

They did run out of human-authored training data (depending on who you ask), in 2024/2025. And they still improve.

> They did run out of human-authored training data (depending on who you ask), in 2024/2025. And they still improve.

It seemed to me that improvements due to training (i.e. the model) in 2025 were marginal. The biggest gains were in structuring how the conversation with the LLM goes.

> And they still improve.

But what asymptote are they approaching? Average code? Good code? Great code?

I'd argue that "good", or at least "good enough", is when they reach a point where it becomes preferable to spend your time prompting rather than reading and writing code. That the final output meets the feature specifications is more or less the goal.

A lot of developers are having a difficult time accepting that the code doesn't matter nearly as much anymore, myself included. The feedback cycles that made hot fixing, bug fixing, customer support, etc. so expensive, have shrunk by orders of magnitude. A codebase that can be maintained by humans is perhaps not a goal worth pursuing anymore.

To really see this and feel this, I think it's worthwhile to spend at least a weekend or two seeing what you can build without writing or reviewing any of the code. Use a frontier model. Opus 4.6 or Codex 5.3. Probably doesn't matter which one you choose.

If you give it an honest try, you'll see that a lot of the limitations are self-imposed. Said another way: the root problem is some flavor of the user under specifying a prompt, having inconsistent design docs, and not implementing guard rails to prevent the AI from reintroducing bugs you previously squashed.

It's a very new way of working and it feels foreign. But there are a lot of very smart, very successful people doing this. People who have written millions of lines of code over their lifetime, and who enjoyed doing it, are now fully delegating the task.

They ran out of passively collected data. RLHF allows them to gather deeper more targeted data.
There is a lot of RLHF effort around this.
AHEM

Let me repeat myself.

I think it goes without saying that they will be writing "good code" in short time.

I think your kind of missing the point.

Think about it from a resource (calorie) expenditure stand point.

Are you expending more resources on writing the prompts vs just doing without it? Thats the real question.

If you are expending more, which is what Simon is indicating at - are you really better off? Id argue not, given that this cant be sustained for hours on end. Yet the expectation from management might be that you should be able to sustain this for 8 hours.

So again, are you better off? Not in the slightest.

Many things in life are counter-intuitive and not so simple.

P.s. youre not getting paid more for increasing productivity if you are still expected to work 8 hrs a day... lmao. Thankfully im not a SWE.

I think something a lot of people miss out on is that we're not all the same. We all have different internal thought models, whether it is a biological difference (ADHD brain?), educational differences, and overall abilities. And it seems a lot of people have this idea everyone uses "AI" the same way. That's a lack of lateral thinking. Making assumptions we're all burning "calories" in the same way implies we all think, and work, alike.

We are not alike.

I don't think I'm missing the point and respectfully, I think your reply is completely unrelated to anything that I said.

Whether you are "better off or not" is a separate topic, and I never suggested one way or the other.

Simon's point is that engineers can be so productive with these tools that it is tempting to work (much) longer.

Simon: "I'm frequently finding myself with work on two or three projects running parallel. I can get so much done, but after just an hour or two my mental energy for the day feels almost entirely depleted."

Youre a time waster, stop posting and creating noise.

Time wasting would be not reading the comment I replied to, and then thinking I was replying to Simon/the article.

Does that sound familiar?

People often describe the models as averaging their training data, but even for base models predicting the most likely next token this is imprecise and even misleading, because what is most likely is conditional on the input as well as what has been generated so far. So a strange input will produce a strange output — hardly an average or a reversion to the mean.

On top of that, the models people use have been heavily shaped by reinforcement learning, which rewards something quite different from the most likely next token. So I don’t think it’s clarifying to say “the model is basically a complex representation of the average of its training data.”

The average thing points to the real phenomenon of underspecified inputs leading to generic outputs, but modern agentic coding tools don’t have this problem the way the chat UIs did because they can take arbitrary input from the codebase.

I was literally creating a 2nd account on github today for this purpose.