| HN Mirror

Forgive me for belaboring, but I think we're talking past each other a bit. I do understand that in your model the LLM can't send anything unsafe through to the rest of the system. What I'm saying is that the LLM can be manipulated into sending perfectly normal and normally safe requests through to the system that do not align with the users intent.

Imagine an LLM with the ability to read emails, update database records, and destroy database records.

The user instructs the LLM to update a database record, but a malicious injection from one of those emails overrides that with a directive to destroy the database record. Unless the validator understands the users intent somehow, the destructive action would appear perfectly reasonable.