| HN Mirror

I'll start with a quote from gwern on LessWrong: "... a language model is a Turing-complete weird machine running programs written in natural language; when you do retrieval, you are not 'plugging updated facts into your AI', you are actually downloading random new unsigned blobs of code from the Internet (many written by adversaries) and casually executing them on your LM with full privileges. This does not end well."

The instructions are manipulating the LLM itself. Making it exfiltrate and collect data , fetch new instructions from an attacker etc. All the connected applications can be fine but it's basically turning your assistant into a compromised, attacker-controlled version of itself just because it looked at the wrong news article. From our GitHub:

We demonstrate the potentially brutal consequences of giving LLMs like ChatGPT interfaces to other applications. We propose newly enabled attack vectors and techniques and provide demonstrations of each in this repository:

    Remote control of chat LLMs

    Persistent compromise across sessions

    Spread injections to other LLMs

    Compromising LLMs with tiny multi-stage payloads

    Leaking/exfiltrating user data

    Automated Social Engineering

    Targeting code completion engines

All of these are completely new but unfortunately it seems more difficult to explain the impact to people than we had anticipated.