Hacker News new | ask | show | jobs
by hedgehog 34 days ago
I did some experiments and Opus seemed pretty able to wire up a harness to find bugs and write PoC + patch for each. It's still a lot of work to get fixes upstreamed from outside so I think even if outsiders have better tools (Mythos etc) it won't change the report rate much, people may find more bugs but they won't report them. I suspect that's part of the calculation of the phased rollout for Mythos, finding bugs is already not the bottleneck.
3 comments

> It's still a lot of work to get fixes upstreamed from outside

I'm going to disagree in the specific case of Firefox. First, although it has diverged a long way from its roots, Mozilla still has the community project ideal in its DNA. Enough, at least, that I stumbled while reading the clause "from outside" -- if you're finding and reporting actual relevant security bugs, you're already on the inside. SpiderMonkey in particular still has a good amount of code being written and even maintained by non-employees. (Examples: Temporal and LoongArch64 JIT support).

Second, the bug bounty program still exists[0] and is being used. If someone were sitting on a pile of AI-discovered exploits, then it has monetary value which is rapidly draining away the longer they aren't reported.[1] That's incentive to put in the work to report them properly.

Third, I agree that finding bugs is likely not the bottleneck. Validating them is. With previous models, the false positive rate was too high so they required too much work to whittle down to the valid ones. A PoC is a very strong signal that a bug is valid, and that's where I just don't believe you: without a really good harness, I don't think Opus was good enough to find very many bugs with PoCs. It could find some, just not very many.[2]

[0] For now. It remains to be seen how it will adapt to the AI age. For the moment, it hasn't been severely nerfed like Google's.

[1] One could make the argument that people who are inexpert enough to only be able to poke an AI to find bugs are also the people more likely to sell them on the black market rather than disclosing them. It seems plausible. Still, some people would still be disclosing, and not many were filing quality bugs pre-Mythos. Some were, but it was a trickle compared to post-Mythos.

[2] Also note that I personally, as a SpiderMonkey developer, don't find a huge amount of value in the AI-generated patches that accompany these bug reports. Sometimes they're useful to better illustrate the problem, especially since the AI's problem analysis is usually subtly wrong in important ways. They can be a decent starting point for a real patch. But I'll still need to go through my own process of figuring out what the right fix is, even in the handful of cases where I end up with the same thing the AI did.

I hadn't taken a look and indeed submitting bugs into the Firefox bug program looks much more accessible than upstreaming a patch into an open source project. It's true just asking the robots "find bugs" isn't enough but it doesn't take a particularly sophisticate harness to make them work for simple targets. I my tests were primarily using a combination of Opus 4.5 and Gemini 3 Flash running in GitHub Copilot, and the harness was constructed by asking the agent something like:

"Assume the perspective of an experienced software security engineer. Set this workspace up as a security bug and remediation factory, applying the principles of Lean Manufacturing (one piece flow, measure cycle time, minimize work in progress, etc). The workflow is to start with a repo URL and work methodically to produce: risk assessment based on existing commit history and advisories for the target and other similar projects, review the codebase for risk areas, set up tools including analysis and fuzzing to identify candidates, write PoC for each candidate, and a proposed fix based on the bug profile and the upstream's preferred contribution style. Rely as much as possible on existing tools, scripted automation in Python, and document templates." (and then a lot of back and forth to steer it to something reasonable)

I took the first fix to finish (an OOB read in a heavily-fuzzed open source library, missed by fuzzers because the post-underflow read happened to always hit a different valid datastructure and not trigger ASan) through to upstream remediation, which end to end took I think six weeks or two months. As you point out the patch itself was functional but IIRC the maintainer decided to do a slightly wider scope change because it was a cleaner fix according to their judgement, something that nobody outside would likely to be able to figure out. Without making the tooling come up with both PoC and patch there is too much noise in the output, so even if the patch is not fully correct I think it's necessary. The actual back and forth of upstreaming was just very slow relative to the bug finding (no shade to the maintainers). Now Firefox sounds different, though the harness is probably much more complicated than testing a library in isolation.

Copilot was wildly underpriced before the recent changes so all of this fit in a normal $40 plan but probably would have been pretty expensive at metered Claude API prices. My tooling has been getting more sophisticated since this experiment, I'm working on a reverse engineering project now and trying to get the process to run hands-off driven by Qwen 3.6 35B. If that works it might provide a way to find bugs on a reasonable budget.

Hi! First of all, thanks for your incredibly thoughtful and enlightening answers, and most of all for helping keep Firefox alive.

You said:

> Still, some people would still be disclosing, and not many were filing quality bugs pre-Mythos. Some were, but it was a trickle compared to post-Mythos.

How much of this could be just due to focus? i.e. prior to the partnership with Anthropic to test Mythos Preview, has there ever been a similarly focused project, specifically trying to find security bugs in Firefox?

That's a fair point, given the restrictions on Mythos and now Opus 4.7. I'm kind of comparing apples and oranges.

There are two things mixed together here. There is targeted scanning that was done by both Anthropic and Mozilla employees, using first Opus and then Mythos. Then there are other non-employee security researchers using AI to find and file bugs, motivated mostly by bug bounties.

The researchers were filing a steady trickle of bugs presumably using Opus 4.6. (Or rather, I saw a steady trickle after other people triaged them; I imagine the incoming stream was a lot busier.) My impression is that those have mostly dried up now. That could be the bias in my sample (I only see a slice of incoming bugs, so my anecdata aren't that strong), or a result of the restrictions added to the generally available models, or a result of there being less to find now that we've fixed so many of the issues found by company-backed bughunts. Or a combination of all three.

I guess my opinion is mostly driven by the difference in the quality and magnitude of bugs coming in from the company-backed scans pre- and post-Mythos. With Opus, there was an initial rush, but then it mostly died down. (For our group. For other groups, it was a series of waves that they never quite made it over before the next one came crashing in.) With Mythos, it was a larger wave and the quality of the bugs was higher. Two quantitative differences that ended up feeling like a qualitative change. So it's my underinformed personal opinion, but to me it feels like: yes, you could continue to find more bugs using a roughly Opus 4.6-strength model, but not that many and not cheaply, and the success rate is going to depend a lot on the harness. In comparison, I don't think we've seen the end of the Mythos wave, and my sense is that Mythos requires much less in the way of a harness.

It feels like the bitter lesson is playing itself out again, which I kinda hate because I want human ingenuity and cleverness to make an important difference, even after the next model has seen what the humans are coming up with.

My suspicion is a lot of the difference in performance in newer models comes from more and better code reasoning and debugging tasks in the RL phase, along with actual security bug finding workflows. When sessions get long and instruction-following gets less reliable you start relying more on the model's baked in behavior + steering from the harness, both still in a way a product of human ingenuity. At least so far. For bug finding I think there will be value to cost/performance tuning for a long time, and hybrid techniques (smarter goal-oriented fuzzing etc).
That makes sense, thanks for taking the time to write this up!
We have a bounty program. If you can find security bugs in Firefox, please let us pay you for them. You don't need to provide a fix; a testcase that crashes in an interesting way is often enough to qualify.

https://www.mozilla.org/en-US/security/client-bug-bounty/

> I suspect that's part of the calculation of the phased rollout for Mythos, finding bugs is already not the bottleneck.

I was wondering this too. By working directly with tech companies and (one assumes) subsidizing tokens, they're empowering the people on the inside who absolutely want to have the bugs fixed.

Who outside of Mozilla is going to pay and spend the effort to find Firefox bugs? Sure some hobbyists and contributors might, but they don't have the institutional knowledge of the codebase which can help guide an agent prompts, nor do they have strong incentives to try and report them, nor do they necessarily have the time to craft good bug reports that stand out from the slop reports.

My assumption would be that most people working to discover bugs this way in Firefox are interested in using them rather than getting them fixed, so maintainers wouldn't necessarily even know the degree to which it was already happening.

The incentive is that Mozilla will pay you thousands of dollars if you find a security bug: https://www.mozilla.org/en-US/security/client-bug-bounty/

We have many outside contributors who have successfully submitted security bugs and received payments.