|
I hadn't taken a look and indeed submitting bugs into the Firefox bug program looks much more accessible than upstreaming a patch into an open source project. It's true just asking the robots "find bugs" isn't enough but it doesn't take a particularly sophisticate harness to make them work for simple targets. I my tests were primarily using a combination of Opus 4.5 and Gemini 3 Flash running in GitHub Copilot, and the harness was constructed by asking the agent something like: "Assume the perspective of an experienced software security engineer. Set this workspace up as a security bug and remediation factory, applying the principles of Lean Manufacturing (one piece flow, measure cycle time, minimize work in progress, etc). The workflow is to start with a repo URL and work methodically to produce: risk assessment based on existing commit history and advisories for the target and other similar projects, review the codebase for risk areas, set up tools including analysis and fuzzing to identify candidates, write PoC for each candidate, and a proposed fix based on the bug profile and the upstream's preferred contribution style. Rely as much as possible on existing tools, scripted automation in Python, and document templates." (and then a lot of back and forth to steer it to something reasonable) I took the first fix to finish (an OOB read in a heavily-fuzzed open source library, missed by fuzzers because the post-underflow read happened to always hit a different valid datastructure and not trigger ASan) through to upstream remediation, which end to end took I think six weeks or two months. As you point out the patch itself was functional but IIRC the maintainer decided to do a slightly wider scope change because it was a cleaner fix according to their judgement, something that nobody outside would likely to be able to figure out. Without making the tooling come up with both PoC and patch there is too much noise in the output, so even if the patch is not fully correct I think it's necessary. The actual back and forth of upstreaming was just very slow relative to the bug finding (no shade to the maintainers). Now Firefox sounds different, though the harness is probably much more complicated than testing a library in isolation. Copilot was wildly underpriced before the recent changes so all of this fit in a normal $40 plan but probably would have been pretty expensive at metered Claude API prices. My tooling has been getting more sophisticated since this experiment, I'm working on a reverse engineering project now and trying to get the process to run hands-off driven by Qwen 3.6 35B. If that works it might provide a way to find bugs on a reasonable budget. |