| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fhouser 129 days ago
	Opus 4.6 usually doesn't disappoint .. No double negative auth checks or race conditions to report on, but I can say that introducing new functionality and patterns mostly requires a few cycles before the "repeatable pattern" is cleanly documented in the spec. When bugs do come up, the agent is quite good at finding the root cause and implementing a fix.

1 comments

Working on a model benchmark focused on which model is good for these tasks. Keep you posted

Thanks,that would be great.

As promised here is the open-source GitRepo so you can give it a go with your tooling: https://github.com/kolega-ai/Real-Vuln-Benchmark

Updated benchmark results published here also. BTW, with v002 we are consistently hitting 75+