| > It's still a lot of work to get fixes upstreamed from outside I'm going to disagree in the specific case of Firefox. First, although it has diverged a long way from its roots, Mozilla still has the community project ideal in its DNA. Enough, at least, that I stumbled while reading the clause "from outside" -- if you're finding and reporting actual relevant security bugs, you're already on the inside. SpiderMonkey in particular still has a good amount of code being written and even maintained by non-employees. (Examples: Temporal and LoongArch64 JIT support). Second, the bug bounty program still exists[0] and is being used. If someone were sitting on a pile of AI-discovered exploits, then it has monetary value which is rapidly draining away the longer they aren't reported.[1] That's incentive to put in the work to report them properly. Third, I agree that finding bugs is likely not the bottleneck. Validating them is. With previous models, the false positive rate was too high so they required too much work to whittle down to the valid ones. A PoC is a very strong signal that a bug is valid, and that's where I just don't believe you: without a really good harness, I don't think Opus was good enough to find very many bugs with PoCs. It could find some, just not very many.[2] [0] For now. It remains to be seen how it will adapt to the AI age. For the moment, it hasn't been severely nerfed like Google's. [1] One could make the argument that people who are inexpert enough to only be able to poke an AI to find bugs are also the people more likely to sell them on the black market rather than disclosing them. It seems plausible. Still, some people would still be disclosing, and not many were filing quality bugs pre-Mythos. Some were, but it was a trickle compared to post-Mythos. [2] Also note that I personally, as a SpiderMonkey developer, don't find a huge amount of value in the AI-generated patches that accompany these bug reports. Sometimes they're useful to better illustrate the problem, especially since the AI's problem analysis is usually subtly wrong in important ways. They can be a decent starting point for a real patch. But I'll still need to go through my own process of figuring out what the right fix is, even in the handful of cases where I end up with the same thing the AI did. |
"Assume the perspective of an experienced software security engineer. Set this workspace up as a security bug and remediation factory, applying the principles of Lean Manufacturing (one piece flow, measure cycle time, minimize work in progress, etc). The workflow is to start with a repo URL and work methodically to produce: risk assessment based on existing commit history and advisories for the target and other similar projects, review the codebase for risk areas, set up tools including analysis and fuzzing to identify candidates, write PoC for each candidate, and a proposed fix based on the bug profile and the upstream's preferred contribution style. Rely as much as possible on existing tools, scripted automation in Python, and document templates." (and then a lot of back and forth to steer it to something reasonable)
I took the first fix to finish (an OOB read in a heavily-fuzzed open source library, missed by fuzzers because the post-underflow read happened to always hit a different valid datastructure and not trigger ASan) through to upstream remediation, which end to end took I think six weeks or two months. As you point out the patch itself was functional but IIRC the maintainer decided to do a slightly wider scope change because it was a cleaner fix according to their judgement, something that nobody outside would likely to be able to figure out. Without making the tooling come up with both PoC and patch there is too much noise in the output, so even if the patch is not fully correct I think it's necessary. The actual back and forth of upstreaming was just very slow relative to the bug finding (no shade to the maintainers). Now Firefox sounds different, though the harness is probably much more complicated than testing a library in isolation.
Copilot was wildly underpriced before the recent changes so all of this fit in a normal $40 plan but probably would have been pretty expensive at metered Claude API prices. My tooling has been getting more sophisticated since this experiment, I'm working on a reverse engineering project now and trying to get the process to run hands-off driven by Qwen 3.6 35B. If that works it might provide a way to find bugs on a reasonable budget.