Hacker News new | ask | show | jobs
by rybosworld 131 days ago
We've gone from "it's glorified auto-complete" to "the quality of working, end-to-end features, is average", in just ~2 years.

I think it goes without saying that they will be writing "good code" in short time.

I also wonder how much of this "I don't trust them yet" viewpoint is coming from people who are using agents the least.

Is it rare that AI one-shots code that I would be willing to raise as a PR with my name on it? Yes, extremely so (almost never).

Can I write a more-specified prompt that improves the AI's output? Also yes. And the amount of time/effort I spend iterating on a prompt, to shape the feature I want, is decreasing as I learn to use the tools better.

I think the term prompt-engineering became loaded to mean "folks who can write very good one-shot prompts". But that's a silly way of thinking about it imo. Any feature with moderate complexity involves discovery. "Prompt iteration" is more descriptive/accurate imo.

5 comments

First you have to classify what “good code” is, something that programmers have still not settled on in the over half a century that the field has existed. I also think what the other reply said is true, going from average to “good code” is way harder because it implies a need for LLMs to self critique beyond what they do today. I don’t think just training on a set of hand picked samples is enough.

There’s also the knowledge cutoff aspect. I’ve found that LLMs often produce outdated Go code that doesn’t utilise the modern language features. Or for cases where it knows about a commonly used library, it uses deprecated methods. RAG/MCP can kind of paper over this problem but it’s still fundamental to LLMs until we have some kind of continuous training.

AI's can self-critique via mechanisms like chain of thought or user specified guard rails like a hook that requires the test suite to pass before a task can be considered complete/ready for human review. These can and do result in higher quality code.

Agree that "good code" is vague - it probably always be. But we can still agree that code quality is going up over time without having a complete specification for what defines "good".

Unfortunately I can only give anecdotes, but in my experience the LLM's 'thinking' does not lead to code quality improvements in the same way that a programmer thinking for a while would.

In my experience having LLMs write Go, it tends to factor code in not so great way from the start, probably due to lacking the mental model of pieces composing together. Furthermore, once a structure is in place, there doesn't seem to be a trigger point that causes the LLM to step back and think about reorganising the code, or how the code it wants to write could be better integrated into what's already there. It tends to be very biased by the structures that already exist and not really question them.

A programmer might write a function, notice it becoming too long or doing too much, and then decide break it down into smaller subroutines. I've never seen an LLM really do this, they seem biased towards being additive.

I believe good code comes from an intuition which is very hard to convey. Imprinting hard rules into the LLM like 'refactor long functions' will probably just lead to overcorrection and poor results. It needs to build its own taste for good code, and I'm not sure if that's possible with current technology.

> Furthermore, once a structure is in place, there doesn't seem to be a trigger point that causes the LLM to step back and think about reorganising the code, or how the code it wants to write could be better integrated into what's already there.

Older models did do this, and it sucked. You'd ask for a change to your codebase and they would refactor a chunk of it and make a bunch of other unrelated "improvements" at the same time.

This was frustrating and made for code that was harder to review.

The latest generation of models appear to have been trained not to do that. You ask for a feature, they'll build that feature with the least changes possible to the code.

I much prefer this. If I want the code refactored I'll say to the model "look for opportunities to refactor this" and then it will start suggesting larger changes.

> A programmer might write a function, notice it becoming too long or doing too much, and then decide break it down into smaller subroutines. I've never seen an LLM really do this, they seem biased towards being additive.

The nice thing is a programmer with an LLM just steps in here, and course-corrects, and still has that value add, without taking all the time to write the boilerplate in between.

And in general, the cleaner your codebase the cleaner LLM modifications will be, it does pick up on coding style.

>The nice thing is a programmer with an LLM just steps in here, and course-corrects

This does not seem to be the direction things are going. People are talking about shipping code they haven't edited, most notably the author of Claude Code. Sometimes they haven't even read the code at all. With LLMs the path of least resistance is to take your hands off the wheel completely. Only programmers taking particular care are still playing an editorial role.

When the code is constructed by an LLM, the human in the driving seat doesn't get a chance to build the mental models that they usually would writing it manually. This stifles the ability to see opportunities to refactor. It is widely considered to be harder to read code than to write it.

>And in general, the cleaner your codebase the cleaner LLM modifications will be

Whilst true, this is a kind of "you're holding it wrong" argument. If LLMs had model of what differentiates good code from bad code, whatever they pull into their context should make no difference.

> Whilst true, this is a kind of "you're holding it wrong" argument. If LLMs had model of what differentiates good code from bad code, whatever they pull into their context should make no difference.

Good code is in the eye of the beholder. What reviewers in one shop would consider good code is dramatically different than another.

Conforming to the existing code base style is good in and of itself, if the context it pulls in makes no difference that makes it useless.

> When the code is constructed by an LLM, the human in the driving seat doesn't get a chance to build the mental models that they usually would writing it manually

I'm asking the LLM for alternatives and options constantly, to test different models. It can give me a write-up description of options, or go spin up subagents to go try 4 different things at once.

> It is widely considered to be harder to read code than to write it

Even more than writing code, I think LLM's are exceptional at reading code. They can review huge amounts of code incredibly fast, to understand very complex systems. And then you can just ask it questions! Don't understand? Ask more questions!

I have mcp-neovim-server open, so I just ask it to open the relevant pieces of code at those lines, and it can then show me. CodeCompanion makes it easy to ask questions about a line. It's amazing how

Reading code was one of the extremely hard parts of programming, and the machine is far far better at it than us!

> When the code is constructed by an LLM, the human in the driving seat doesn't get a chance to build the mental models that they usually would writing it manually.

Here's one way to tell me you haven't tried the thing without saying you haven't tried the thing. The ability to do deep inquiry into topics & to test &btry different models is far far far better than it has ever been. We aren't stuck with what we right, we can keep iterating &b trying at vastly lower cost, to do the hard work to discover what is a good model. Programmers rarely have had the luxury of time and space to keep working on a problem again and again, to adjust and change and tweak until the architecture truly sings. Now you can try a weeks worth of architectures in an afternoon. There is no better time for those who want to understand to do so.

I feel like one thing missing from this thread is that most people adopting AI at a serious level are building really strong AGENTS.md files, that refine tastes and practices and forms. The AI is pretty tasteless, isnt deliberate. It is up to us to explore the possibility space when working on problems, and to create good context that steers towards good solutions. And our ability to get information out, to probe into systems, to asses, to test hypothesis, is vastly vastly higher, which we can keep using to become far better steersfolk.

Building expertise isn't a straight line. Going from a bad to average is much easier than going from average to good.
Yeah Tesla and Waymo know this quite well. There's a reason we don't have moon bases yet.
isnt ut more likely we are 80% of the way to maximum performance by doing 20 % of the work and the remaining tiny performance increase will require a multiple of the work we have done so far and will leave us with performamce that "isnt good enough"? Seems way more likely to me than a linear progression to agi from here
Is there a big enough dataset of 'good' code to train from though?
I (and lots of people) used to think the models would run out of training data and it would halt progress.

They did run out of human-authored training data (depending on who you ask), in 2024/2025. And they still improve.

> They did run out of human-authored training data (depending on who you ask), in 2024/2025. And they still improve.

It seemed to me that improvements due to training (i.e. the model) in 2025 were marginal. The biggest gains were in structuring how the conversation with the LLM goes.

> And they still improve.

But what asymptote are they approaching? Average code? Good code? Great code?

I'd argue that "good", or at least "good enough", is when they reach a point where it becomes preferable to spend your time prompting rather than reading and writing code. That the final output meets the feature specifications is more or less the goal.

A lot of developers are having a difficult time accepting that the code doesn't matter nearly as much anymore, myself included. The feedback cycles that made hot fixing, bug fixing, customer support, etc. so expensive, have shrunk by orders of magnitude. A codebase that can be maintained by humans is perhaps not a goal worth pursuing anymore.

To really see this and feel this, I think it's worthwhile to spend at least a weekend or two seeing what you can build without writing or reviewing any of the code. Use a frontier model. Opus 4.6 or Codex 5.3. Probably doesn't matter which one you choose.

If you give it an honest try, you'll see that a lot of the limitations are self-imposed. Said another way: the root problem is some flavor of the user under specifying a prompt, having inconsistent design docs, and not implementing guard rails to prevent the AI from reintroducing bugs you previously squashed.

It's a very new way of working and it feels foreign. But there are a lot of very smart, very successful people doing this. People who have written millions of lines of code over their lifetime, and who enjoyed doing it, are now fully delegating the task.

They ran out of passively collected data. RLHF allows them to gather deeper more targeted data.
There is a lot of RLHF effort around this.
AHEM

Let me repeat myself.

I think it goes without saying that they will be writing "good code" in short time.

I think your kind of missing the point.

Think about it from a resource (calorie) expenditure stand point.

Are you expending more resources on writing the prompts vs just doing without it? Thats the real question.

If you are expending more, which is what Simon is indicating at - are you really better off? Id argue not, given that this cant be sustained for hours on end. Yet the expectation from management might be that you should be able to sustain this for 8 hours.

So again, are you better off? Not in the slightest.

Many things in life are counter-intuitive and not so simple.

P.s. youre not getting paid more for increasing productivity if you are still expected to work 8 hrs a day... lmao. Thankfully im not a SWE.

I think something a lot of people miss out on is that we're not all the same. We all have different internal thought models, whether it is a biological difference (ADHD brain?), educational differences, and overall abilities. And it seems a lot of people have this idea everyone uses "AI" the same way. That's a lack of lateral thinking. Making assumptions we're all burning "calories" in the same way implies we all think, and work, alike.

We are not alike.

I don't think I'm missing the point and respectfully, I think your reply is completely unrelated to anything that I said.

Whether you are "better off or not" is a separate topic, and I never suggested one way or the other.

Simon's point is that engineers can be so productive with these tools that it is tempting to work (much) longer.

Simon: "I'm frequently finding myself with work on two or three projects running parallel. I can get so much done, but after just an hour or two my mental energy for the day feels almost entirely depleted."

Youre a time waster, stop posting and creating noise.

Time wasting would be not reading the comment I replied to, and then thinking I was replying to Simon/the article.

Does that sound familiar?