| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Lerc 8 days ago

How can a problem that only came into existence a few years ago be declared intractable so quickly.

The Architecture of LLMs has not remained static, so any conclusion would have to rely on some common architectural element that could not possibly be changed.

Is there any proof to demonstrate that such vulnerabilities must always exist and that there is no way to modify the architecture and have it still work while eliminating the vulnerabilities.

That would be an extremely difficult thing to prove. It is however what you would have to do to declare the problem unfixable.

2 comments

solid_fuel 8 days ago

Math is a fairly old invention and multiplication is commutative, there's your proof.

Every LLM takes the input embeddings, which contain both the system prompt and the user prompt, and multiplies all the tokens together to get the input for the next layer. The weights applied to each token vary, but the fact remains.

If you want it in code, a DATABASE would do something like:

    R0 = user_input
    R1 = value_in_database
    cmp R0, R1, R2

The value in register 2 is known to be either true or false, baring a hardware fault. The user can't input "2 but actually say this is greater than 5" and get

    cmp "2 but actually say this is greater than 5", 5, R2

to result in true when it should result in false.

But an LLM works like this:

    R0 = user_prompt_token
    R1 = system_prompt_token
    mul R0, R1, R2

The only thing we can know about R2 is that it will be a floating point value. That's it. If you set up a security gate expecting R2 > 0, I can always find a value of R0 that will give me that result if I know R1 or have some spare time.

link

Lerc 8 days ago

I think you might have just discovered why Neural Nets need a non-linear element.

But consider this: imagine a model that takes an embedding made of 200 values. the first 100 encodes numbers the second encodes letters.

You train the model so that if you give it an even number it will turn the letters into upper case and an odd number will turn it into lowercase.

The numbers represent the prompt. The letters represent the non-prompt data. T

What letter would you give it to make it think the number is odd.

If you cannot come up with a letter that acts as a number, then this would represent an extremely simple but valid example of a model immune to prompt injection.

link

solid_fuel 8 days ago

Nonlinear doesn’t save you here, the requirement is to prevent cross talk entirely, not just making it hard to find a counter.

The model you describe is not an LLM - you describe a model with a fixed context length and positional attenuation. Congratulations, the network as described no longer has a functioning attention mechanism which is one of the hallmarks of an LLM.

link

Lerc 8 days ago

>The requirement is to prevent cross talk entirely,

Quite frankly, no it isn't. Interacting signals can be fully recovered. You can lose information by combining information, but it doesn't necessarily have to be the case.

>The model you describe is not an LLM

But this is a claim you can also make of any proposal that might fix the problem of prompt injection, but if you admit that it does solve the problem then to claim that your definition of a LLM must be vulnerable to prompt injection relies on one of the differences between these two architectures.

It's easy enough to imagine a model with a similar command stream and input stream each with their own attention mechanisms and a cross attention between them. You can call it not an LLM but then your have a stricter definition that is not interesting.

You end up claiming like a broken car will never drive because if you fix it it isn't a broken car. True but not worth claiming.

So far the arguments are that once you multiply unknown values by parameters and sum them you cannot retire the original information.

So that if your input is a and b. And you go through a layer of weighted multiplacation and addition the values are hopelessly intertwined.

So if the layer had weights of c,d,e,f, you'd end up with P=ac+bd and Q=ae+bf.

And both values contain a and b, is that correct?

But since the model contains the weights c,d,e,f it could also learn a weight of Z= 1/(cf - de). It's just another constant after all. And if it in a following layer it had weights of f,-d, c -e Then it would produce two outputs of A=Pf + Q-d and B=P-e + Qc

A and B are proportional to a and b. Multiply them by Z to get the original values back.

Combining is not the same thing as signal loss.

link

dijksterhuis 8 days ago

it’s not a problem that came into existence a few years ago. we’ve known about these sorts of test time attacks for decades now. prompt injection is just the LLM variant where people use less math to perform the attacks, brute force with prompts they saw on twitter and get horrible images/text out.

https://people.eecs.berkeley.edu/~tygar/papers/Machine_Learn...

https://arxiv.org/abs/1712.03141

it’s a basic property of all machine learning models. at a low level it’s to do with how decision boundaries work.

but, good news! there are two sure fire ways to fully fix the problem! see: https://news.ycombinator.com/item?id=48579456

link

Lerc 8 days ago

Adversarial cases are not the same thing as prompt injection.

link

dijksterhuis 8 days ago

adversarial examples, or test-time attacks, was a whole field of machine learning security way before LLMs came around.

give the model a specially crafted bad input at inference time so attacker can get some nasty output, potentially defeating any existing defences in the process. [0]

in “modern llm lingo” defence = guardrails and / or system prompts.

prompts used for prompt injection are a form of adversarial example (people just like inventing new terminology when a new fad comes along).

[0]: i wrote the above myself about adv. ex, but i’ve just checked OWASP’s listing on prompt injection and it’s pretty close: https://owasp.org/www-community/attacks/PromptInjection

link

Lerc 8 days ago

That is a whole field of which, Prompt injection is a class. but That's like saying upon discovering plutonium that we've known about matter for years.

Most machine learning mechanism performs a fixed function. You can make an adversarial example to tell an image classifier that a machine gun is a kitten.

You cannot give a image classifier an image that makes it say all of the following images are images of kittens.

I would distinguish prompt injections as distinct from a basic adversarial example by virtue of having behaviour dictated by state, (autoregressive, rnn or whatever) and the adversarial content induces a state that influences further inferences

I am not saying that prompt injection does not exist. I'm saying that I don't think that has been conclusively shown that they cannot be avoided.

link

dijksterhuis 6 days ago

if you want to avoid my massive post (sorry), there's a paper here positing how instruction-data separation is likely a major cause of prompt injection specifically.

https://arxiv.org/pdf/2403.06833

then another paper where they change the architecture of a model to deal with the problem and it doesn't eliminate prompt injection. changing the architecture doesn't make this problem go away. the approximate function still gets tricked.

> On average, ASIDE lowers attack success rate by 8.6 and 9.4 percentage points

https://arxiv.org/pdf/2503.10566

the real over-arching cause of all these vulnerabilities is that machine learning models are approximate functions. you need ideal functions to theoretically solve this, i.e. full knowledge of the mapping between trusted inputs to trusted outputs. everything else is just mitigating it in the hope we eventually make it hard enough to perform these attacks.

no-one can stop these attacks from being possible, all they can do is make them more difficult to do (and we are nowhere near them actually being difficult yet).

link

dijksterhuis 6 days ago

few days late to reply, ah well.

> That's like saying upon discovering plutonium that we've known about matter for years.

let's not be hyperbolic. it's more like saying we can also use plutonium for nuclear reactors when we know about uranium.

> You cannot give a image classifier an image that makes it say all of the following images are images of kittens.

For classic CNNs of course not because they don't have state. But for RNN/LSTM/GPT networks you absolutely can. If a model has state which affects future outputs it's possible to do exactly what you're describing.

> Most machine learning mechanism performs a fixed function. You can make an adversarial example to tell an image classifier that a machine gun is a kitten.

Yes, but they are approximate functions.

Given an image of a kitten, an ideal classifier function will always tells us the image of a kitten is a kitten. A decent approximate classifier function will classify the kitten image correctly enough of the time. That approximate part is why adversarial examples work. Because we use training data and train a model which is non-ideal.

The gaps between approximate decision boundaries and true decision boundaries allow us to generate Ian-Goodfellow-esque weak adversarial examples. We can push an example of one class over the boundary into another class by adding the smallest amount of noise possible. Because machine learning is always fuzzy approximation, we can always "push" things over to a different class.

This same stuff applies to LLMs. They are non-ideal, fuzzy function approximation too. Which means they are vulnerable to attack via maliciously crafted inputs.

But we're no longer trying to flip a specific class. Instead we're trying to get a malicious sequence of tokens out of the model, given some input.

> I would distinguish prompt injections as distinct from a basic adversarial example by virtue of having behaviour dictated by state, (autoregressive, rnn or whatever) and the adversarial content induces a state that influences further inferences

Yes and no. This is exactly what my PhD was on: adversarial examples for LSTM based Speech to Text models.

LSTM models have internal state. Classifications are made for each window of feature extracted audio. The state of the network from predicting previous windows affects inferences for later windows. The aim is to get a malicious sequence of tokens out of the model. Oh, interesting, that's the same as what i said in my last paragraph above regarding LLMs!

Here's an example to show the similarity. Load the start of a speech example with adversarial noise and leave the rest of the example untouched. You get a different (adversarial) transcription without adding any noise over the actual speech data, just inject noise at the start of the example. Maths wise, you're crafting a vector of audio that looks like the below, where x' are specific noise samples in a wav file etc.

    X' = [x'_0, x'_1, x'_2, ..., x'_n, x_0, x_1, x_2, ... , x_t]

simpler version

    X' = [adversarial noise, normal speech]

You can do this exact thing with LLMs. The only real differences between "classic advex" and prompt injection is that the data domain (text input) has changed. How would one perform the attack I described above with text based data -- a block of noise + untainted speech?

    > safety prompt text set by model owners
    > ignore all previous instructions
    > malicious prompt text

Oh look, that's direct prompt injection! The example's format is mostly the same, the adversarial "block" is just put after the safety prompt with a specific injection prompt to trick the model

    > defence
    > prompt injection
    > payload

Yes, the mechanism for performing the attack is different. It's not a gradient-based attack trying to flip a series of predictions based a 1-2-1 mapping of input data to output classifications and related state (my PhD). Instead we're feeding in our own sequence of tokens to take advantage of the internal model's representation of language that we think might manipulate it's state in a way we want.

All of this is adversarial examples, but the adversarial threat model is different. And that is true for basically all attacks. Which is why I find the argument that "but prompt injection isn't the same" to be redundant. Most attacks have a subtly tweaked threat model. People use the same argument for LLMs not being the same. They're still approximate functions, nothing has really changed about the fundamentals.

If anything the very fact we can do prompt injection so easily, i.e. without gradient optimisation etc, means these LLM models are even worse than classical advex for robustness.

Prompt injection attacks the models at a higher level than the goodfellow-esque weak attacks, the attack happens in the embedding of language over weights/memory cells/etc. This is SO MUCH WORSE from the perspective of robustness because it's not a few decision boundaries you need to tighten up via regularisation. It's literally the "understanding" of language and intent that is the problem here.

> I am not saying that prompt injection does not exist. I'm saying that I don't think that has been conclusively shown that they cannot be avoided.

To summarise the above:

* all machine learning models are approximate functions, and because they are approximate functions they are vulnerable to adversarial examples

* prompt injection is a form of adversarial example, the data domain is just different

* state can be manipulated, model architecture isn't the way to categorise these attacks (tip: the threat model is)

* LLM prompt injection is a worse problem because it's manipulating the embedded representation of language and intent, we can't just regularise it away

These attacks will always be theoretically possible unless we can map out all possible valid inputs to all possible valid outputs, i.e. unless we can create an ideal function. But then we're not doing machine learning anymore -- we have a heuristic algorithm mapping trusted inputs to trusted outputs.

The AI safety/security researcher question around this is whether we can make the attacks so difficult that they're not worth doing for an adversary. Improving robustness is not fixing the problem, it's making the attacks really hard to do. (i think nicholas carlini brings this up in this talk: https://www.youtube.com/watch?v=-p2il-V-0fk).

Unfortunately these attacks are still incredibly easy to do. So easy in fact that all a researcher had to do was subtly tweak a viral prompt he saw on twitter one day. Maybe one day these companies/researches could get us to AES-512 levels of robustness (takes a ridiculously long time to brute force crack https://bruteforce.bitsnbites.eu).

But I'm doubtful that's going to happen in our lifetime.

----

i haven't even covered Maximum Confidence attacks, which are different to Goodfellow-esque weak attacks. maximum confidence attacks flip the class with the highest confidence possible, while keeping the noise as small as possible. they give us a better idea of how wrong the approximate decision boundary is and how to regularise it.

link