|
|
|
|
|
by tptacek
299 days ago
|
|
In my design I'm modeling the LLM with access to untrusted information (emails, support tickets) as an adversary. And I'm saying that the adversarial LLM communicates with the rest of the system through structured messages, not English text. It turns out Simon Willison has been saying this for some time now; he calls it the "dual LLM" design, I think? (For me, both terms are a little broken; you can have way more than 2, and it's "contexts" you're multiplying, not LLMs.) |
|
Imagine an LLM with the ability to read emails, update database records, and destroy database records.
The user instructs the LLM to update a database record, but a malicious injection from one of those emails overrides that with a directive to destroy the database record. Unless the validator understands the users intent somehow, the destructive action would appear perfectly reasonable.