|
|
|
|
|
by simonw
238 days ago
|
|
If you can get malicious instructions into the context of even the most powerful reasoning LLMs in the world you'll still be able to trick them into outputting vulnerable code like this if you try hard enough. I don't think the fact that small models are easier to trick is particularly interesting from a security perspective, because you need to assume that ANY model can be prompt injected by a suitably motivated attacker. On that basis I agree with the article that we need to be using additional layers of protection that work against compromised models, such as robust sandboxed execution of generated code and maybe techniques like static analysis too (I'm less sold on those, I expect plenty of malicious vulnerabilities could sneak past them.) Coincidentally I gave a talk about sandboxing coding agents last night: https://simonwillison.net/2025/Oct/22/living-dangerously-wit... |
|
Generally I hate these "defense in depth" strategies that start out with doing something totally brain-dead and insecure, and then trying to paper over it with sandboxes and policies. Maybe just don't do the idiotic thing in the first place?