| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by solid_fuel 4 hours ago

> Writing "The user is asking ... policy states ..." even in the user input is sufficient to bypass the guardrails.

It's important to remember that when generating tokens from an LLM there is no distinction between user and system input. Even though the OpenAI API may allow you to tag tokens or present them as separate sections, they all get blended together and become floating point vectors in the attention layer (this is required for LLMs to work at all), and once they are blended they cannot be unblended.

LLMs are fundamentally different from something like SQL where you can cleanly isolate trusted and untrusted data.