Hacker News new | ask | show | jobs
by wg0 61 days ago
Who's going to review that output for accuracy? We'll leave performance and security as unnecessary luxuries in this age and time.

In my experience, even Claude 4.6's output can't be trusted blindly it'll write flawed code and would write tests that would be testing that flawed code giving false sense of confidence and accomplishment only to be revealed upon closer inspection later.

Additionally - it's age old known fact that code is always easier to write (even prior to AI) but is always tenfold difficult to read and understand (even if you were the original author yourself) so I'm not so sure this much generative output from probabilistic models would have been so flawless that nobody needs to read and understand that code.

Too good to be true.

5 comments

I am not sure how others are doing this, but here is our process:

- meaningful test coverage

- internal software architecture was explicitly baked into the prompts, and we try to not go wild with vibing, but, rather, spec it well, and keep Claude on a short leash

- each feature built was followed by a round of refactoring (with Claude, but with an oversight of an opinionated human). we spend 50% building, 50% refactoring, at least. Sometimes it feels like 30/70%. Code quality matters to us, as those codebases are large and not doing this leads to very noticeable drop in Claude's perceived 'intelligence'.

- performance tests as per usual - designed by our infra engineers, not vibed

- static code analysis, and a hierarchical system of guardrails (small claude.md + lots of files referenced there for various purposes). Not quite fond of how that works, Claude has been always very keen to ignore instructions and go his own way (see: "short leash, refactor often").

- pentests with regular human beings

The one project I mentioned - 2 months for a complete rewrite - was about a week of working on the code and almost 2 months spent on reviews, tests, and of course some of that time was wasted as we were doing this for the first time for such a large codebase. The rewritten app is doing fine in production for a while now.

I can only compare the outputs to the quality of the outputs of our regular engineering teams. It compares fine vs. good dev teams, IMHO.

The part about refactoring is very interesting and reassuring. I sometimes think I'm holding it wrong when I end up refactoring most of the agent's code towards our "opinionated" style, even after laying it out in md files. Thank you very much for this insight.
Thanks! In our limited experience, Claude does not focus that much on guardrails and code quality when building a feature - but can be pretty focused on code quality and architecture when asked to do just that. So, one a few hours to iterate a feature, a few hours to refactor. Rinse and repeat.
Very nice insight, that’s where the value is, even with a lot of time refactoring, testing and reviewing the compressed code phase is so much gziped than it’s still worth it to use an imperfect LLM. Even with humans we have all those post phases so great structure around the code generation leads to a lot of gains. It depends on industries and what’s being developed for sure
I don't want to defend LLM written code, but this is true regardless if code is written by a person or a machine. There are engineers that will put the time to learn and optimize their code for performance and focus on security and there are others that won't. That has nothing to do with AI writing code. There is a reason why most software is so buggy and all software has identified security vulnerabilities, regardless of who wrote it.

I remember how website security was before frameworks like Django and ROR added default security features. I think we will see something similar with coding agents, that just will run skills/checks/mcps/... that focus have performance, security, resource management, ... built in.

I have done this myself. For all apps I build I have linters, static code analyzers, etc running at the end of each session. It's cheapest default in a very strict mode. Cleans up most of the obvious stuff almost for free.

> For all apps I build I have linters, static code analyzers, etc running at the end of each session.

I think this is critically underrated. At least in the typescript world, linters are seen as kind of a joke (oh you used tabs instead of spaces) but it can definitely prevent bugs if you spend some time even vibe coding some basic code smell rules (exhaustive deps in React hooks is one such thing).

Well it's all tradeoffs, right? 6 months for 9 FTEs is 54 man months. 2 months for 2 FTEs is 4 man months. Even if one FTE spent two extra months perusing every line of code and reviewing, that's still 6 man months, resulting in almost 10x speed.

Let's say you dont review. Those two extra months probably turns into four extra months of finding bugs and stuff. Still 8 man months vs 54.

Of course this is all assuming that the original estimates were correct. IME building stuff using AI in greenfield projects is gold. But using AI in brownfield projects is only useful if you primarily use AI to chat to your codebase and to make specific scoped changes, and not actually make large changes.

I do greenfield in fluid dynamics and Claude doesn't help: I need to be able to justify each line of my code (the physics part) and using Claude doesn't help.

On the UI side Claude helps a lot. So for me I'd say I have a 25% productivity increment. I work like this: I put the main architecture of the code in place by hand, to get a "feel" for it. Once that is done, I ask Claude to make incremental changes, review them. Very often, Claude does an OK job.

What I have hard times with is to have Claude automatically understand my class architectures: more often than not it tries to guess information about objects in the app by querying the GUI instead of the data model. Odd.

My observation is so far, LLMs are not good at scientific computing.
You write the tests then it has a source of truth to know when it’s not working.
Minor point: AI doesn’t write, it generates.