Hacker News new | ask | show | jobs
by ildari 241 days ago
The idea is that quarantined LLM has access to untrusted data, but doesn't have access to any tools or sensitive data.

The main LLM does have access to the tools or sensitive data, but doesn't have direct access to untrusted data (quarantine LLM is restricted at the controller level to respond only with integer digits, and only to legitimate questions from the main llm)

1 comments

Then I don't think I understand your full setup.

In the example case, without having access to the issue text (the evil data), how does the main LLM actually figure out what to do if the quarantined LLM can just answer with digits?

Sure it can discover that it's a request to update the documentation, but how does it get the information it needs to actually change the erroneous part of the documentation?

This is a topic I haven't addressed in the article. There are two answer types: "guessable" (discussed here) and unguessable (such as unique IDs, emails, etc.). For the second case, the main LLM can request a quarantined LLM to store the result at the controller level and only return a reference to this data. This data is then exposed only at the end of the AI agent's execution to prevent influencing its actions.