| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by skaul 291 days ago

> I haven't seen a single credible technique from anyone that can distinguish content from instructions

You specifically mean that it's ~impossible to distinguish between content and instructions ONCE it is fed to the model, right? I agree with that. I was talking about a prior step, at the browser level. At the point that the query is sent to the backend, the browser would be able to distinguish between web contents and user prompt. This is useful for checking user-alignment of the output of the reasoning model (keeping in mind that the moment you feed in untrusted text into a model all bets are off).

We're actively thinking and working on this, so will have more to announce soon, but this discussion is useful!

1 comments

simonw 291 days ago

Even if you know the source of the text before you feed it to the model you still need to solve the problem of how to send untrusted text from a user through a model without that untrusted text being able to trigger additional tool calls or actions.

The most credible pattern I've seen for that comes from the DeepMind CaMeL paper - I would love to see a browser agent that robustly implemented those ideas: https://simonwillison.net/2025/Apr/11/camel/

link

skaul 291 days ago

> Even if you know the source of the text before you feed it to the model you still need to solve the problem of how to send untrusted text from a user through a model without that untrusted text being able to trigger additional tool calls or actions.

We're exploring taking the action plan that a reasoning model (which sees both trusted and untrusted text) comes up with and passing it to a second model, which doesn't see the untrusted text and which then evaluates it.

> The most credible pattern I've seen for that comes from the DeepMind CaMeL paper

Yeah we're aware of the CaMeL paper and are looking into it, but it's definitely challenging from an implementation pov.

Also, I see that we said "The browser should clearly separate the user’s instructions from the website’s contents when sending them as context to the model" in the blog post. That should have been "backend", not "model". Agreed that once you feed both trusted and untrusted tokens into the LLM the output must be considered unsafe.

link

jrflowers 291 days ago

>We're exploring taking the action plan that a reasoning model (which sees both trusted and untrusted text) comes up with and passing it to a second model, which doesn't see the untrusted text and which then evaluates it.

How is this different from the Dual-LLM pattern that’s described in the link that was posted? It immediately describes how that setup is still susceptible to prompt injection.

>With the Dual LLM pattern the P-LLM delegates the task of finding Bob’s email address to the Q-LLM—but the Q-LLM is still exposed to potentially malicious instructions.

link