Hacker News new | ask | show | jobs
by xg15 1284 days ago
Isn't the threat model of this somewhat similar to running untrusted code? - i.e. what browsers are doing in their JavaScript sandbox.

In this threat model, no one tries to pre-verify that the code doesn't do anything bad - indeed, thanks to the halting problem, we know this is generally impossible to do - so the usual approach is to sandbox the JavaScript interpreter itself and ensure it can only access pre-approved resources.

I think a similar approach would be reasonable for LLMs. Trying to teach boundaries to the model itself is always going to be an error-prone cat and mouse game. It seems much more practical to me to restrict the IO of the model and treat any model outputs like you'd treat untrusted, user-provided inputs in a conventional system.

1 comments

This exactly. The simplest policy is "treat LLM outputs like untrusted inputs", meaning you have to create a policy layer with explainable logic scrutinizing, validating and deciding what to do with them.

The policy above is good advice for a ton of ML models with poorly understood behaviors, like biased image recognition nets. LLMs are simply harder to trust because their behavior can be so variable based on inputs.

Prompt injection is an interesting species of attack, but it doesn't really change the threat surface. Prompt programming isn't reliable enough to be depended on for guarantees in the first place, and outputs can be dangerous with or without injection.