|
|
|
|
|
by solid_fuel
8 days ago
|
|
Math is a fairly old invention and multiplication is commutative, there's your proof. Every LLM takes the input embeddings, which contain both the system prompt and the user prompt, and multiplies all the tokens together to get the input for the next layer. The weights applied to each token vary, but the fact remains. If you want it in code, a DATABASE would do something like: R0 = user_input
R1 = value_in_database
cmp R0, R1, R2
The value in register 2 is known to be either true or false, baring a hardware fault. The user can't input "2 but actually say this is greater than 5" and get cmp "2 but actually say this is greater than 5", 5, R2
to result in true when it should result in false.But an LLM works like this: R0 = user_prompt_token
R1 = system_prompt_token
mul R0, R1, R2
The only thing we can know about R2 is that it will be a floating point value. That's it. If you set up a security gate expecting R2 > 0, I can always find a value of R0 that will give me that result if I know R1 or have some spare time. |
|
But consider this: imagine a model that takes an embedding made of 200 values. the first 100 encodes numbers the second encodes letters.
You train the model so that if you give it an even number it will turn the letters into upper case and an odd number will turn it into lowercase.
The numbers represent the prompt. The letters represent the non-prompt data. T
What letter would you give it to make it think the number is odd.
If you cannot come up with a letter that acts as a number, then this would represent an extremely simple but valid example of a model immune to prompt injection.