Hacker News new | ask | show | jobs
by tptacek 299 days ago
In my design I'm modeling the LLM with access to untrusted information (emails, support tickets) as an adversary. And I'm saying that the adversarial LLM communicates with the rest of the system through structured messages, not English text.

It turns out Simon Willison has been saying this for some time now; he calls it the "dual LLM" design, I think? (For me, both terms are a little broken; you can have way more than 2, and it's "contexts" you're multiplying, not LLMs.)

1 comments

Forgive me for belaboring, but I think we're talking past each other a bit. I do understand that in your model the LLM can't send anything unsafe through to the rest of the system. What I'm saying is that the LLM can be manipulated into sending perfectly normal and normally safe requests through to the system that do not align with the users intent.

Imagine an LLM with the ability to read emails, update database records, and destroy database records.

The user instructs the LLM to update a database record, but a malicious injection from one of those emails overrides that with a directive to destroy the database record. Unless the validator understands the users intent somehow, the destructive action would appear perfectly reasonable.