Hacker News new | ask | show | jobs
by jcalvinowens 55 days ago
My experience with these tools is that they generate absolutely enormous amounts of insidiously wrong false positives, and it actually takes a decent amount of skill to work through the 99% which is garbage with any velocity.

Of course some people don't do that, and send all the reports anyway... and then scream from the hilltops about how incredible LLMs are when by sheer luck one happens to be right. Not only is that blatant p-hacking, it's incredibly antisocial.

It's disingenuous marketing speak to say LLMs are "finding" any security holes at all: they find a thousand hypotheticals of which one or two might be real. A broken clock is right twice a day.

4 comments

Your experience seems to be at least 3-6 months old. Long time kernel maintainers have recently written on this subject. They say that ~3 months ago the quality and accuracy of the reports crossed a threshold and are now legitimately useful.
The experience I'm describing was two weeks ago.

Yes, what we see coming out of the bottom of funnel is now is a little better. But it's sort of like reading day trading blogs: nobody shares their negative results, which in my direct experience are so bad they almost negate any investigative benefit. I also think part of this is that a small set of very prolific spammers were sufficiently discouraged to stop.

This is incorrect. Here's the curl maintainer talking about dozens of bugs found using LLMs: https://daniel.haxx.se/blog/2025/10/10/a-new-breed-of-analyz...
From the curl blog post:

> "Remarkably few of them complete false positives."

That's worse than a report that can be easily dismissed
You mean the same curl that have been crying from the hilltops that they are being DDOSed with slop security reports? That curl?
I used GitHub's Copilot once and let it check one of my repositories for security issues. It found countless (like 30 or 40 or so for a single PHP file of some ~400 lines). Some even sounded reasonable enough, so I had a closer look, just to make sure. In the end none of it was an issue at all. In some cases it invented problems which would have forced to add wild workaround code around simple calls into the PHP standard library. And that was the only time I wasted my time with that. :D
I strongly disagree with this take, and frankly, this reads like the state of "research" pre-LLMs where people would run fuzzers and scripted analysis tools (which by their nature DO generate enormous amounts of insidiously wrong false positives) and stuff them into bug bounty boxes, then collect a paycheck when one was correct by luck.

Modern LLMs with a reasonable prompt and some form of test harness are, in my experience, excellent at taking a big list of potential vulnerabilities and figuring out which ones might be real. They're also pretty good, depending on the class of vuln and the guardrails in the model, at developing a known-reachable vulnerability into real exploit tooling, which is also a big win. This does require the _slightest_ bit of work (ie - don't prompt the LLM with "find possible use after free issues in this code," or it will give you a lot of slop; prompt the LLM with "determine whether the memory safety issues in this file could present a security risk" and you get somewhere), but not some kind of elaborate setup or prompt hacking, just a little common sense.