| There are assumptions here that are intimately related to the meta questions mentioned. > Prompt injection or prompt attacks are well known and likely impossible to guard against. They are impossible to guard against under the assumption that the current LLM paradigm is all there is and all there could possibly be. There could be other realizations of AI. The latest impressive achievements are yielding an ongoing identification of the current computational approaches with human intelligence itself, with how we humans model reality by using natural language, and with how we could ever imagine that a computer can model reality via natural language. These are all strong assumptions and are very common even amongst researchers. > Can you really get a human to be invulnerable to manipulation? Most definitely not, but we are specifically not talking about humans, but about: > AI, which relies upon computers, algorithms, and numbers written on memory I am not making a case for machines being completely invulnerable to manipulation, which requires an analysis of the entailment structures of reality that would yield that complete invulnerability is impossible, but for better direct control on the internals, rather then relying on external instructions that easily undergo jailbreaks with simple prompt attacks. > Why would we expect the machines to be any better? One argument is: because they can be programmed and the memory they rely upon for its algorithms can be edited directly, with both accuracy and precision. The missing piece is how to model reality via natural language in a computer, in a way that we would know what to edit in order to affect the model with accuracy and precision. LLMs, currently, are non-editable. When interacting with ChatGPT, its answers A will be generated from chat history H, which includes prompts and guidelines, by an immutable function, or program, f: A = f(H). It is remarkable that, in LLMs, f cannot be edited and is never entailed by the individual chat H. Since we can have multiple exchanges in the same history, H will itself contain information (entailment) from f, but never the other way around: f is not entailed by A or H, it is fixed, and only entailed externally by the design and training steps. f can be fine-tuned, yet it will retain remnants of past training, hence it cannot be truly be edited at will unless we retrain the whole model. Even then, control on f is neither accurate nor precise. It seems that non-editable LLMs remove some of the agency that is inherent in programming: editing the internals of a program to shape the entailment structures that we want to realize, with accuracy and precision. I am by no means indicating that editable AI models that can be steered are easy to achieve, rather that the very possibility thereof is rarely mentioned, and often implicitly assumed not to exist in absolute statements that in fact only strictly apply to the current mainstream approaches. |
This is only true for the first release. Consider that OpenAI has been actively collecting input/output pairs from their users, and then retraining and updating the model. Thus A and H have impacted ChatGPT. This in turn effects how people interact with the system.
You can certainly constrain f to a single point in time, but most people will not. They think of ChatGPT as f and that f is changing or evolving (in the non-literal sense). So depending on how you look at it, f is indeed editable. Opinions will differ here and there is no right answer.
Google wrote a good paper on this feedback loop almost 10 years ago called "Machine Learning: The High Interest Credit Card of Technical Debt" that is even more relevant today.
https://research.google/pubs/pub43146/