| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fhouser 129 days ago

I recently shipped a "vibe-coded" project. You raise a good point: I hadn't considered the confidence gap. If it is true that LLM generated code produces more vulnerabilities in addition to there being more code overall, all while at the same time the developer feels better about their results, then that is concerning.

This is how I go about ensuring there is little to no chaos (your mileage may vary based on project size and characteristics): - Plan your project manually, do not outsource thinking to the LLM. This includes being intentional about architecture, tech-stack, dependencies, etc.. - I have planning, orchestrating, coding, and reviewing agents. These should be self-explanatory, but there's a catch: the workflow is automated. OpenCode allows you to define "subagents" which can be called by "primary" agents. I will write a detailed Gitlab issue that my planning agent can fetch and read. It will create a detailed resolution plan that I can point the orchestration agent to. The orchestrator then delegates implementation to one or more coding agents simultaneously. Results are in turn delegated to reviewer agents. If the reviewer agents don't complain, then the results are ready for human review in an MR. - Changes that pass all review are documented in the project spec. E.g., if new modules are added that require an auth guard pattern implementation that is already documented in the spec, they will be listed as relevant sites for that auth guard pattern, etc..

I feel like the LLM agents have been more thorough and consistent than I could have been without them. This goes for refactors too: Since the entire project is essentially mapped out in the spec.md file(s), it's hard for the agent to miss a relevant site in the code. Human review is key. Don't merge code you don't understand.

1 comments

jfaganel99 129 days ago

This is one of the most practical breakdowns I’ve seen for a while. The spec.md as a living architecture map is smart, and documenting auth guard pattern sites as new modules get added is exactly the kind of thing that prevents issues creeping in.

The bit I’d push on: do your reviewer agents catch logic errors… things like a double negative auth check or a race condition in a payment flow. Those usually pass a check because code looks intentional and clean. Curious whether your reviewers are prompted specifically for security logic or more for spec conformance?

“Don’t merge code you don’t understand” is the right closer. Most setups don’t force that discipline cause people dont have the knowledge :)

link

fhouser 129 days ago

Opus 4.6 usually doesn't disappoint .. No double negative auth checks or race conditions to report on, but I can say that introducing new functionality and patterns mostly requires a few cycles before the "repeatable pattern" is cleanly documented in the spec. When bugs do come up, the agent is quite good at finding the root cause and implementing a fix.

link

jfaganel99 129 days ago

Working on a model benchmark focused on which model is good for these tasks. Keep you posted

link

fhouser 129 days ago

Thanks,that would be great.

link

jfaganel99 124 days ago

As promised here is the open-source GitRepo so you can give it a go with your tooling: https://github.com/kolega-ai/Real-Vuln-Benchmark

Updated benchmark results published here also. BTW, with v002 we are consistently hitting 75+

link