Hacker News new | ask | show | jobs
by jfaganel99 91 days ago
This is one of the most practical breakdowns I’ve seen for a while. The spec.md as a living architecture map is smart, and documenting auth guard pattern sites as new modules get added is exactly the kind of thing that prevents issues creeping in.

The bit I’d push on: do your reviewer agents catch logic errors… things like a double negative auth check or a race condition in a payment flow. Those usually pass a check because code looks intentional and clean. Curious whether your reviewers are prompted specifically for security logic or more for spec conformance?

“Don’t merge code you don’t understand” is the right closer. Most setups don’t force that discipline cause people dont have the knowledge :)

1 comments

Opus 4.6 usually doesn't disappoint .. No double negative auth checks or race conditions to report on, but I can say that introducing new functionality and patterns mostly requires a few cycles before the "repeatable pattern" is cleanly documented in the spec. When bugs do come up, the agent is quite good at finding the root cause and implementing a fix.
Working on a model benchmark focused on which model is good for these tasks. Keep you posted
Thanks,that would be great.
As promised here is the open-source GitRepo so you can give it a go with your tooling: https://github.com/kolega-ai/Real-Vuln-Benchmark

Updated benchmark results published here also. BTW, with v002 we are consistently hitting 75+