Hacker News new | ask | show | jobs
by sfink 32 days ago
That's a fair point, given the restrictions on Mythos and now Opus 4.7. I'm kind of comparing apples and oranges.

There are two things mixed together here. There is targeted scanning that was done by both Anthropic and Mozilla employees, using first Opus and then Mythos. Then there are other non-employee security researchers using AI to find and file bugs, motivated mostly by bug bounties.

The researchers were filing a steady trickle of bugs presumably using Opus 4.6. (Or rather, I saw a steady trickle after other people triaged them; I imagine the incoming stream was a lot busier.) My impression is that those have mostly dried up now. That could be the bias in my sample (I only see a slice of incoming bugs, so my anecdata aren't that strong), or a result of the restrictions added to the generally available models, or a result of there being less to find now that we've fixed so many of the issues found by company-backed bughunts. Or a combination of all three.

I guess my opinion is mostly driven by the difference in the quality and magnitude of bugs coming in from the company-backed scans pre- and post-Mythos. With Opus, there was an initial rush, but then it mostly died down. (For our group. For other groups, it was a series of waves that they never quite made it over before the next one came crashing in.) With Mythos, it was a larger wave and the quality of the bugs was higher. Two quantitative differences that ended up feeling like a qualitative change. So it's my underinformed personal opinion, but to me it feels like: yes, you could continue to find more bugs using a roughly Opus 4.6-strength model, but not that many and not cheaply, and the success rate is going to depend a lot on the harness. In comparison, I don't think we've seen the end of the Mythos wave, and my sense is that Mythos requires much less in the way of a harness.

It feels like the bitter lesson is playing itself out again, which I kinda hate because I want human ingenuity and cleverness to make an important difference, even after the next model has seen what the humans are coming up with.

2 comments

My suspicion is a lot of the difference in performance in newer models comes from more and better code reasoning and debugging tasks in the RL phase, along with actual security bug finding workflows. When sessions get long and instruction-following gets less reliable you start relying more on the model's baked in behavior + steering from the harness, both still in a way a product of human ingenuity. At least so far. For bug finding I think there will be value to cost/performance tuning for a long time, and hybrid techniques (smarter goal-oriented fuzzing etc).
That makes sense, thanks for taking the time to write this up!