| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cheriot 114 days ago
	This is a general thing with agent orchestration. A good sandbox does something for your local environment, but nothing for remote machines/APIs. I can't say this loudly enough, "an LLM with untrusted input produces untrusted output (especially tool calls)." Tracking sources of untrusted input with LLMs will be much harder than traditional [SQL] injection. Read the logs of something exposed to a malicious user and you're toast.

3 comments

paxys 114 days ago

Given the "random" nature of language models even fully trusted input can produce untrusted output.

"Find emails that are okay to delete, and check with me before deleting them" can easily turn into "okay deleting all your emails", as so many examples posted online are showing.

I have found this myself with coding agents. I can put "don't auto commit any changes" in the readme, in model instructions files, at the start of every prompt, but as soon as the context window gets large enough the directive will be forgotten, and there's a high chance the agent will push the commit without my explicit permission.

link

ramoz 114 days ago

Information flow control is a solid mindset but operationally complex and doesn’t actually safeguard you from the main problem.

Put an openclaw like thing in your environment, and it’ll paperclip your business-critical database without any malicious intent involved.

link

tovej 114 days ago

Even an LLM with trusted input produces untrusted output.

link