| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simonw 238 days ago

If you can get malicious instructions into the context of even the most powerful reasoning LLMs in the world you'll still be able to trick them into outputting vulnerable code like this if you try hard enough.

I don't think the fact that small models are easier to trick is particularly interesting from a security perspective, because you need to assume that ANY model can be prompt injected by a suitably motivated attacker.

On that basis I agree with the article that we need to be using additional layers of protection that work against compromised models, such as robust sandboxed execution of generated code and maybe techniques like static analysis too (I'm less sold on those, I expect plenty of malicious vulnerabilities could sneak past them.)

Coincidentally I gave a talk about sandboxing coding agents last night: https://simonwillison.net/2025/Oct/22/living-dangerously-wit...

3 comments

inimino 238 days ago

The most "shocking" thing to me in the article is that people (apparently) think it's acceptable to run a system where content you've never seen can be fed into the LLM when it's generating code that you're putting in production. In my opinion, if you're doing that, your whole system is already compromised and you need to literally throw away what you're doing and start over.

Generally I hate these "defense in depth" strategies that start out with doing something totally brain-dead and insecure, and then trying to paper over it with sandboxes and policies. Maybe just don't do the idiotic thing in the first place?

link

fwip 238 days ago

When you say "content you've never seen," does this include the training data and fine-tune content?

You could imagine a sufficiently motivated attacker putting some very targeted stuff in their training material - think StuxNet - "if user is affiliated with $entity, switch goals to covert exfiltration of $valuable_info."

link

inimino 238 days ago

> does this include the training data and fine-tune content?

No, I'm excluding that because I'm responding to the post which starts out with the example of: [prompt containing obvious exploit] -> [code containing obvious exploit] and proceeds immediately to the conclusion that local LLMS are less secure. In my opinion, if you're relying on the LLM to reject a prompt because it contains an exploit, instead of building a system that does not feed exploits into the LLM in the first place, security exploits are probably the least of your concerns.

There actually are legitimate concerns with poisoned training sets, and stuxnet-level attacks could plausibly achieve something along these lines, but the post wasn't about that.

There's a common thread among a lot of "LLM security theatre" posts that starts from implausible or brain-dead scenarios and then asserts that big AI providers adding magical guard rails to their products is the solution.

The solution is sanity in the systems that use LLMs, not pointing the gun at your foot and firing and hoping the LLM will deflect the bullet.

link

fwip 238 days ago

That's fair, thank you for your explanation.

link

mritchie712 238 days ago

We started giving our (https://www.definite.app/) agent a sandbox (we use e2b.dev) and it's solved so many problems. It's created new problems, but net-net it's been a huge improvement.

Something like "where do we store temporary files the agent creates?" becomes obvious if you have a sandbox you can spin up and down in a couple seconds.

link

knowaveragejoe 238 days ago

Is there any chance your talk was recorded?

link

simonw 238 days ago

It wasn't, but the written version of it it is actually better than what I said in the room (since I got to think a little bit harder and add relevant links).

link

semi-extrinsic 238 days ago

IIUC your talk "just" suggests using sandbox-exec on Mac, which (as you point out) is sadly labeled as deprecated.

Is that really the best solution the world has to offer in 2025? LLMs aside, there is a whole host of supply chain risk issues that would be resolved by deploying convenient and strong sandboxes everywhere.

link

simonw 238 days ago

My preferred solutions right now:

1. A sandbox on someone else's computer. Claude Code for web, Codex Cloud, Gemini Jules, GitHub Codespaces, ChatGPT/Claude Code Interpreter

2. A Docker container. I think these are robust enough to be safe.

3. sandbox-exec related tricks. I haven't poked hard enough at Claude Code's new sandbox-exec sandbox yet - they only released it on Monday. OpenAI Codex CLI was using sandbox-exec too last time I looked but again, I've not reviewed it enough to be comfortable with it.

I'm hoping more credible options come along for the sandboxing problems.

link

mentalgear 238 days ago

I found Vibekit's (open-source https://docs.vibekit.sh/sdk) approach of allowing you to chose your own sandboxing solution for any coding cli the most flexible. Also works with openCode and local or cloud sandboxes ! Really quality piece of software that more devs should know about. I'm surprised Simon hasn't tried it yet.

link

knowaveragejoe 238 days ago

If I understand correctly, Claude Code will(shortly, if not already) make use of Anthropic's sandbox that wraps Seatbelt on OS X, not sandbox-exec?

It's cool that they made this open source. It seems straightforward and useful enough that it could be used on its own for sandboxing purposes.

https://docs.claude.com/en/docs/claude-code/sandboxing

https://github.com/anthropic-experimental/sandbox-runtime

link

simonw 237 days ago

That library is using sandbox-exec to access Seatbelt: https://github.com/anthropic-experimental/sandbox-runtime/bl...

link

simonw 238 days ago

Yeah they shipped that feature on Monday, you can access it via the /sandbox command. I haven't put it through its paces enough to get a feel for if I trust it yet though.

link