|
|
|
|
|
by verdverm
1132 days ago
|
|
They could probably just visit Reddit, there are ample prompts there. Prompt injection or prompt attacks are well known and likely impossible to guard against. Can you really get a human to be invulnerable to manipulation? Why would we expect the machines to be any better? |
|
> Prompt injection or prompt attacks are well known and likely impossible to guard against.
They are impossible to guard against under the assumption that the current LLM paradigm is all there is and all there could possibly be. There could be other realizations of AI. The latest impressive achievements are yielding an ongoing identification of the current computational approaches with human intelligence itself, with how we humans model reality by using natural language, and with how we could ever imagine that a computer can model reality via natural language. These are all strong assumptions and are very common even amongst researchers.
> Can you really get a human to be invulnerable to manipulation?
Most definitely not, but we are specifically not talking about humans, but about:
> AI, which relies upon computers, algorithms, and numbers written on memory
I am not making a case for machines being completely invulnerable to manipulation, which requires an analysis of the entailment structures of reality that would yield that complete invulnerability is impossible, but for better direct control on the internals, rather then relying on external instructions that easily undergo jailbreaks with simple prompt attacks.
> Why would we expect the machines to be any better?
One argument is: because they can be programmed and the memory they rely upon for its algorithms can be edited directly, with both accuracy and precision. The missing piece is how to model reality via natural language in a computer, in a way that we would know what to edit in order to affect the model with accuracy and precision.
LLMs, currently, are non-editable. When interacting with ChatGPT, its answers A will be generated from chat history H, which includes prompts and guidelines, by an immutable function, or program, f: A = f(H). It is remarkable that, in LLMs, f cannot be edited and is never entailed by the individual chat H. Since we can have multiple exchanges in the same history, H will itself contain information (entailment) from f, but never the other way around: f is not entailed by A or H, it is fixed, and only entailed externally by the design and training steps. f can be fine-tuned, yet it will retain remnants of past training, hence it cannot be truly be edited at will unless we retrain the whole model. Even then, control on f is neither accurate nor precise.
It seems that non-editable LLMs remove some of the agency that is inherent in programming: editing the internals of a program to shape the entailment structures that we want to realize, with accuracy and precision.
I am by no means indicating that editable AI models that can be steered are easy to achieve, rather that the very possibility thereof is rarely mentioned, and often implicitly assumed not to exist in absolute statements that in fact only strictly apply to the current mainstream approaches.