Hacker News new | ask | show | jobs
by nodeshiftcloud 641 days ago
we find the idea of fine-tuning an LLM to triage and fix insecure code intriguing. However, we have concerns about the limitations posed by the size of the training dataset. As @tptacek mentioned, relying on "hundreds of closed source projects" might not provide the diversity needed to effectively identify a wide range of vulnerabilities, especially in complex systems like the Linux kernel. Incorporating open-source projects could enrich the model's understanding and improve its accuracy. Additionally, benchmarking the model by attempting to generate CVEs from open-source code seems like a practical way to assess its real-world effectiveness. Has anyone experimented with expanding the training data or testing the model against known vulnerabilities in open-source repositories?
1 comments

That's what we've done. Unfortunately, I realized the sentence reads weirdly. It's meant to say we use hundreds of repositories: close-source projects we own + open-source projects that are vulnerable by design + open source projects. I've updated the language in the post.

Doing so, we've been able to capture a very wide range of vulnerabilities namely in web application vulnerabilities. We've done this across small projects to very large ones too.