|
|
|
|
|
by wg0
61 days ago
|
|
Who's going to review that output for accuracy? We'll leave performance and security as unnecessary luxuries in this age and time. In my experience, even Claude 4.6's output can't be trusted blindly it'll write flawed code and would write tests that would be testing that flawed code giving false sense of confidence and accomplishment only to be revealed upon closer inspection later. Additionally - it's age old known fact that code is always easier to write (even prior to AI) but is always tenfold difficult to read and understand (even if you were the original author yourself) so I'm not so sure this much generative output from probabilistic models would have been so flawless that nobody needs to read and understand that code. Too good to be true. |
|
- meaningful test coverage
- internal software architecture was explicitly baked into the prompts, and we try to not go wild with vibing, but, rather, spec it well, and keep Claude on a short leash
- each feature built was followed by a round of refactoring (with Claude, but with an oversight of an opinionated human). we spend 50% building, 50% refactoring, at least. Sometimes it feels like 30/70%. Code quality matters to us, as those codebases are large and not doing this leads to very noticeable drop in Claude's perceived 'intelligence'.
- performance tests as per usual - designed by our infra engineers, not vibed
- static code analysis, and a hierarchical system of guardrails (small claude.md + lots of files referenced there for various purposes). Not quite fond of how that works, Claude has been always very keen to ignore instructions and go his own way (see: "short leash, refactor often").
- pentests with regular human beings
The one project I mentioned - 2 months for a complete rewrite - was about a week of working on the code and almost 2 months spent on reviews, tests, and of course some of that time was wasted as we were doing this for the first time for such a large codebase. The rewritten app is doing fine in production for a while now.
I can only compare the outputs to the quality of the outputs of our regular engineering teams. It compares fine vs. good dev teams, IMHO.