Hacker News new | ask | show | jobs
by mtlynch 76 days ago
> What is not mentioned is that Claude Code also found one thousand false positive bugs, which developers spent three months to rule out.

Source? I haven't seen this anywhere.

In my experience, false positive rate on vulnerabilities with Claude Opus 4.6 is well below 20%.

5 comments

To the issue of AI submitted patches being more of a burden than a boon, many projects have decided to stop accepting AI-generated solutioning:

https://blog.devgenius.io/open-source-projects-are-now-banni...

These are just a few examples. There are more that google can supply.

According to Willy Tarreau[0] and Greg Kroah-Hartman[1], this trend has recently significantly reversed, at least form the reports they've been seeing on the Linux kernel. The creator of curl, Daniel Steinberg, before that broader transition, also found the reports generated by LLM-powered but more sophisticated vuln research tools useful[2] and the guy who actually ran those tools found "They have low false positive rates."[3]

Additionally, there was no mention in the talk by the guy who found the vuln discussed in the TFA of what the false positive rate was, or that he had to sift through the reports because it was mostly slop — or whether he was doing it out of courtesy. Additionally, he said he found only several hundred, iirc, not "thousands." All he said was:

"I have so many bugs in the Linux kernel that I can’t report because I haven’t validated them yet… I’m not going to send [the Linux kernel maintainers] potential slop, but this means I now have several hundred crashes that they haven’t seen because I haven’t had time to check them." (TFA)

He quite evidently didn't have to sift through thousands, or spend months, to find this one, either.

[0]: https://lwn.net/Articles/1065620/ [1]: https://www.theregister.com/2026/03/26/greg_kroahhartman_ai_... [2]: https://simonwillison.net/2025/Oct/2/curl/p [3]: https://joshua.hu/llm-engineer-review-sast-security-ai-tools...

No, they haven't. Read the ai slop you posted carefully.

It's a policy update that enables maintainers to ignore low effort "contributions" that come from untrusted people in order to reduce reviewing workload.

An Eternal September problem, kind of.

Didn't you just restate what the parent claimed?
No, that's not at all the same thing: ai-generated contributions from people with a track record for useful contributions are still accepted.
Right. AI submissions are so burdensome that they have had to refuse them from all except a small set of known contributors.

The fact that there’s a small carve out for a specific set of contributors in no way disputes what Supermancho claimed.

A powertool that needs discretion and good judgement to be used well is being restricted to people with a track record of displaying good judgement. I see nothing wrong here.

AI enables volume, which is a problem. But it is also a useful tool. Does it increase review burden? Yes. Is it excessively wasteful energy wise? Yes. Should we avoid it? Probably no. We have to be pragmatic, and learn to use the tools responsibly.

Yes, but technically no different than "good contributions from humans are still accepted, AI slop can fuck off".

Since the onus falls on those "people with a track record for useful contributions" to verify, design tastefully, test and ensure those contributions are good enough to submit - not on the AI they happen to be using.

If it fell on the AI they're using, then any random guy using the same AI would be accepted.

Same. Codex and Claude Code on the latest models are really good at finding bugs, and really good at fixing them in my experience. Much better than 50% in the latter case and much faster than I am.
Source: """AI is bad"""
In my experience, the issue has been likelihood of exploitation or issue severity. Claude gets it wrong almost all the time.

A threat model matters and some risks are accepted. Good luck convincing an LLM of that fact

In TFA:

   I have so many bugs in the Linux kernel that I can’t 
   report because I haven’t validated them yet… I’m not going 
   to send [the Linux kernel maintainers] potential slop, 
   but this means I now have several hundred crashes that they
   haven’t seen because I haven’t had time to check them.
    
    —Nicholas Carlini, speaking at [un]prompted 2026
Those aren't false positives; they're results he hasn't yet inspected.

I wrote a longer reply here: https://news.ycombinator.com/item?id=47638062

>Those aren't false positives; they're results he hasn't yet inspected.

It's not a XOR

The article quote was being given as the supposed source for "Claude Code also found one thousand false positive bugs, which developers spent three months to rule out", so should substantiate that claim - which it doesn't.

If the claim was instead just "a good portion of the hundreds more potential bugs it found might be false positives", then sure.

Yes it is. They're not not false positives until they're reported and consume maintainer time.
False positives can be eliminated mechanistically by testing if they actually work, in a sufficiently isolated automated test apparatus.

The hard thing is reducing detected crashes to well-formulated test cases that help rather than hinder maintainers.

some of them certainly are…
The comment said "Claude Code also found one thousand false positive bugs, which developers spent three months to rule out.".

Please explain how a bug can both be unvalidated, and also have undergone a three month process to determine it is a false positive?