Hacker News new | ask | show | jobs
by nahsra 1002 days ago
My experience is they're not good at this for most vulnerability classes, especially the those that are tough to discover by classical methods. Have you had any experience using them for this?

Trivial vulnerabilities are easily discoverable yes -- but, they are also trivially discoverable by standard automation available today. I've found GPT-4 to be shockingly bad at vulnerability analysis for all except the most popular vulnerability classes. My speculation is that there just isn't enough literature on these vulnerability classes for it to have practical mastery of them.

Complex vulnerabilities are the emergent phenomena of multiple events across a codebase and it's dependencies, involving control flow, data flow, while missing type information and other runtime data. Even Anthropic's 100K context windows won't nearly fit it all, and if you stuff all the code into embeddings, the ability to reason across all this space will be poor.

You can train a model to ask very pointed questions about particular snippets, but wholesale LLM-based analysis to find vulnerabilities seems like it'll be extremely slow, expensive and inaccurate.

2 comments

I am surprised that you can even say "it's not very good." The sorts of prompts I'm imagining I would expect to trigger the "I'm afraid I can't do that Dave" safeguards every time. I guess yeah, I'm just imagining using it as an advanced fuzzer but I think the thing about using an LLM is you can take the fuzzer code and ask the LLM to just generate slight variations, feed that into a test and ask the LLM to flag ones that look like they might have been an exploit. And when it generates nonsense code, you just throw out those runs. But on the other hand hallucination feels like an advantage here since it's going to do things you never would have thought to test.
I don't have any problem getting it to help with exploit development. I never had any issue with that with any of them, in fact, which is surprising in retrospect.

Inference is so slow, and almost everything about fuzzers are meant to be super fast. Maybe there's a late stage part in crash validation/analysis where you can use it but my bias is that we're just not there yet.

I would not expect an LLM to be good at this without specialized training. I have tried prompting for code generation.

I do not know of LLMs that have been specifically trained on, say, the testing corpus of some "lint" programs and against known vulns. As you point out, it wouldn't be possible as a user of an LLM AI to do the equivalent by showing it some vulns, while it would be perfectly reasonable to get an LLM to write, for example, business case studies by showing it examples.

I don't think specialized training solves any of the problems I mentioned. It doesn't increase the window size, or provide any of the types of highly specialized and optimized multi-file, multi-technique analysis, or make it any cheaper.