Hacker News new | ask | show | jobs
by fhouser 83 days ago
Opus 4.6 usually doesn't disappoint .. No double negative auth checks or race conditions to report on, but I can say that introducing new functionality and patterns mostly requires a few cycles before the "repeatable pattern" is cleanly documented in the spec. When bugs do come up, the agent is quite good at finding the root cause and implementing a fix.
1 comments

Working on a model benchmark focused on which model is good for these tasks. Keep you posted
Thanks,that would be great.
As promised here is the open-source GitRepo so you can give it a go with your tooling: https://github.com/kolega-ai/Real-Vuln-Benchmark

Updated benchmark results published here also. BTW, with v002 we are consistently hitting 75+