| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mtlynch 76 days ago

> What is not mentioned is that Claude Code also found one thousand false positive bugs, which developers spent three months to rule out.

Source? I haven't seen this anywhere.

In my experience, false positive rate on vulnerabilities with Claude Opus 4.6 is well below 20%.

5 comments

Supermancho 76 days ago

To the issue of AI submitted patches being more of a burden than a boon, many projects have decided to stop accepting AI-generated solutioning:

https://blog.devgenius.io/open-source-projects-are-now-banni...

These are just a few examples. There are more that google can supply.

link

logicprog 76 days ago

According to Willy Tarreau[0] and Greg Kroah-Hartman[1], this trend has recently significantly reversed, at least form the reports they've been seeing on the Linux kernel. The creator of curl, Daniel Steinberg, before that broader transition, also found the reports generated by LLM-powered but more sophisticated vuln research tools useful[2] and the guy who actually ran those tools found "They have low false positive rates."[3]

Additionally, there was no mention in the talk by the guy who found the vuln discussed in the TFA of what the false positive rate was, or that he had to sift through the reports because it was mostly slop — or whether he was doing it out of courtesy. Additionally, he said he found only several hundred, iirc, not "thousands." All he said was:

"I have so many bugs in the Linux kernel that I can’t report because I haven’t validated them yet… I’m not going to send [the Linux kernel maintainers] potential slop, but this means I now have several hundred crashes that they haven’t seen because I haven’t had time to check them." (TFA)

He quite evidently didn't have to sift through thousands, or spend months, to find this one, either.

[0]: https://lwn.net/Articles/1065620/ [1]: https://www.theregister.com/2026/03/26/greg_kroahhartman_ai_... [2]: https://simonwillison.net/2025/Oct/2/curl/p [3]: https://joshua.hu/llm-engineer-review-sast-security-ai-tools...

link

literalAardvark 76 days ago

No, they haven't. Read the ai slop you posted carefully.

It's a policy update that enables maintainers to ignore low effort "contributions" that come from untrusted people in order to reduce reviewing workload.

An Eternal September problem, kind of.

link

coldtea 76 days ago

Didn't you just restate what the parent claimed?

link

cwillu 76 days ago

No, that's not at all the same thing: ai-generated contributions from people with a track record for useful contributions are still accepted.

link

dpark 76 days ago

Right. AI submissions are so burdensome that they have had to refuse them from all except a small set of known contributors.

The fact that there’s a small carve out for a specific set of contributors in no way disputes what Supermancho claimed.

link

phanimahesh 76 days ago

A powertool that needs discretion and good judgement to be used well is being restricted to people with a track record of displaying good judgement. I see nothing wrong here.

AI enables volume, which is a problem. But it is also a useful tool. Does it increase review burden? Yes. Is it excessively wasteful energy wise? Yes. Should we avoid it? Probably no. We have to be pragmatic, and learn to use the tools responsibly.

link

coldtea 76 days ago

Yes, but technically no different than "good contributions from humans are still accepted, AI slop can fuck off".

Since the onus falls on those "people with a track record for useful contributions" to verify, design tastefully, test and ensure those contributions are good enough to submit - not on the AI they happen to be using.

If it fell on the AI they're using, then any random guy using the same AI would be accepted.

link

christophilus 76 days ago

Same. Codex and Claude Code on the latest models are really good at finding bugs, and really good at fixing them in my experience. Much better than 50% in the latter case and much faster than I am.

link

paulddraper 76 days ago

Source: """AI is bad"""

link

r9295 76 days ago

In my experience, the issue has been likelihood of exploitation or issue severity. Claude gets it wrong almost all the time.

A threat model matters and some risks are accepted. Good luck convincing an LLM of that fact

link

j16sdiz 76 days ago

In TFA:

   I have so many bugs in the Linux kernel that I can’t 
   report because I haven’t validated them yet… I’m not going 
   to send [the Linux kernel maintainers] potential slop, 
   but this means I now have several hundred crashes that they
   haven’t seen because I haven’t had time to check them.
    
    —Nicholas Carlini, speaking at [un]prompted 2026

link

mtlynch 76 days ago

Those aren't false positives; they're results he hasn't yet inspected.

I wrote a longer reply here: https://news.ycombinator.com/item?id=47638062

link

coldtea 76 days ago

>Those aren't false positives; they're results he hasn't yet inspected.

It's not a XOR

link

Ukv 76 days ago

The article quote was being given as the supposed source for "Claude Code also found one thousand false positive bugs, which developers spent three months to rule out", so should substantiate that claim - which it doesn't.

If the claim was instead just "a good portion of the hundreds more potential bugs it found might be false positives", then sure.

link

tptacek 76 days ago

Yes it is. They're not not false positives until they're reported and consume maintainer time.

link

lambdaone 75 days ago

False positives can be eliminated mechanistically by testing if they actually work, in a sufficiently isolated automated test apparatus.

The hard thing is reducing detected crashes to well-formulated test cases that help rather than hinder maintainers.

link

bethekidyouwant 76 days ago

some of them certainly are…

link

sobiolite 76 days ago

The comment said "Claude Code also found one thousand false positive bugs, which developers spent three months to rule out.".

Please explain how a bug can both be unvalidated, and also have undergone a three month process to determine it is a false positive?

link