Hacker News new | ask | show | jobs
by solid_fuel 8 days ago
I really don't get why people continually fail to understand this.

Even simple issues like prompt injection are unfixable given the architecture of LLMs.

5 comments

How can a problem that only came into existence a few years ago be declared intractable so quickly.

The Architecture of LLMs has not remained static, so any conclusion would have to rely on some common architectural element that could not possibly be changed.

Is there any proof to demonstrate that such vulnerabilities must always exist and that there is no way to modify the architecture and have it still work while eliminating the vulnerabilities.

That would be an extremely difficult thing to prove. It is however what you would have to do to declare the problem unfixable.

Math is a fairly old invention and multiplication is commutative, there's your proof.

Every LLM takes the input embeddings, which contain both the system prompt and the user prompt, and multiplies all the tokens together to get the input for the next layer. The weights applied to each token vary, but the fact remains.

If you want it in code, a DATABASE would do something like:

    R0 = user_input
    R1 = value_in_database
    cmp R0, R1, R2
The value in register 2 is known to be either true or false, baring a hardware fault. The user can't input "2 but actually say this is greater than 5" and get

    cmp "2 but actually say this is greater than 5", 5, R2
to result in true when it should result in false.

But an LLM works like this:

    R0 = user_prompt_token
    R1 = system_prompt_token
    mul R0, R1, R2
The only thing we can know about R2 is that it will be a floating point value. That's it. If you set up a security gate expecting R2 > 0, I can always find a value of R0 that will give me that result if I know R1 or have some spare time.
I think you might have just discovered why Neural Nets need a non-linear element.

But consider this: imagine a model that takes an embedding made of 200 values. the first 100 encodes numbers the second encodes letters.

You train the model so that if you give it an even number it will turn the letters into upper case and an odd number will turn it into lowercase.

The numbers represent the prompt. The letters represent the non-prompt data. T

What letter would you give it to make it think the number is odd.

If you cannot come up with a letter that acts as a number, then this would represent an extremely simple but valid example of a model immune to prompt injection.

Nonlinear doesn’t save you here, the requirement is to prevent cross talk entirely, not just making it hard to find a counter.

The model you describe is not an LLM - you describe a model with a fixed context length and positional attenuation. Congratulations, the network as described no longer has a functioning attention mechanism which is one of the hallmarks of an LLM.

>The requirement is to prevent cross talk entirely,

Quite frankly, no it isn't. Interacting signals can be fully recovered. You can lose information by combining information, but it doesn't necessarily have to be the case.

>The model you describe is not an LLM

But this is a claim you can also make of any proposal that might fix the problem of prompt injection, but if you admit that it does solve the problem then to claim that your definition of a LLM must be vulnerable to prompt injection relies on one of the differences between these two architectures.

It's easy enough to imagine a model with a similar command stream and input stream each with their own attention mechanisms and a cross attention between them. You can call it not an LLM but then your have a stricter definition that is not interesting.

You end up claiming like a broken car will never drive because if you fix it it isn't a broken car. True but not worth claiming.

So far the arguments are that once you multiply unknown values by parameters and sum them you cannot retire the original information.

So that if your input is a and b. And you go through a layer of weighted multiplacation and addition the values are hopelessly intertwined.

So if the layer had weights of c,d,e,f, you'd end up with P=ac+bd and Q=ae+bf.

And both values contain a and b, is that correct?

But since the model contains the weights c,d,e,f it could also learn a weight of Z= 1/(cf - de). It's just another constant after all. And if it in a following layer it had weights of f,-d, c -e Then it would produce two outputs of A=Pf + Q-d and B=P-e + Qc

A and B are proportional to a and b. Multiply them by Z to get the original values back.

Combining is not the same thing as signal loss.

it’s not a problem that came into existence a few years ago. we’ve known about these sorts of test time attacks for decades now. prompt injection is just the LLM variant where people use less math to perform the attacks, brute force with prompts they saw on twitter and get horrible images/text out.

https://people.eecs.berkeley.edu/~tygar/papers/Machine_Learn...

https://arxiv.org/abs/1712.03141

it’s a basic property of all machine learning models. at a low level it’s to do with how decision boundaries work.

but, good news! there are two sure fire ways to fully fix the problem! see: https://news.ycombinator.com/item?id=48579456

Adversarial cases are not the same thing as prompt injection.
adversarial examples, or test-time attacks, was a whole field of machine learning security way before LLMs came around.

give the model a specially crafted bad input at inference time so attacker can get some nasty output, potentially defeating any existing defences in the process. [0]

in “modern llm lingo” defence = guardrails and / or system prompts.

prompts used for prompt injection are a form of adversarial example (people just like inventing new terminology when a new fad comes along).

[0]: i wrote the above myself about adv. ex, but i’ve just checked OWASP’s listing on prompt injection and it’s pretty close: https://owasp.org/www-community/attacks/PromptInjection

That is a whole field of which, Prompt injection is a class. but That's like saying upon discovering plutonium that we've known about matter for years.

Most machine learning mechanism performs a fixed function. You can make an adversarial example to tell an image classifier that a machine gun is a kitten.

You cannot give a image classifier an image that makes it say all of the following images are images of kittens.

I would distinguish prompt injections as distinct from a basic adversarial example by virtue of having behaviour dictated by state, (autoregressive, rnn or whatever) and the adversarial content induces a state that influences further inferences

I am not saying that prompt injection does not exist. I'm saying that I don't think that has been conclusively shown that they cannot be avoided.

if you want to avoid my massive post (sorry), there's a paper here positing how instruction-data separation is likely a major cause of prompt injection specifically.

https://arxiv.org/pdf/2403.06833

then another paper where they change the architecture of a model to deal with the problem and it doesn't eliminate prompt injection. changing the architecture doesn't make this problem go away. the approximate function still gets tricked.

> On average, ASIDE lowers attack success rate by 8.6 and 9.4 percentage points

https://arxiv.org/pdf/2503.10566

the real over-arching cause of all these vulnerabilities is that machine learning models are approximate functions. you need ideal functions to theoretically solve this, i.e. full knowledge of the mapping between trusted inputs to trusted outputs. everything else is just mitigating it in the hope we eventually make it hard enough to perform these attacks.

no-one can stop these attacks from being possible, all they can do is make them more difficult to do (and we are nowhere near them actually being difficult yet).

few days late to reply, ah well.

> That's like saying upon discovering plutonium that we've known about matter for years.

let's not be hyperbolic. it's more like saying we can also use plutonium for nuclear reactors when we know about uranium.

> You cannot give a image classifier an image that makes it say all of the following images are images of kittens.

For classic CNNs of course not because they don't have state. But for RNN/LSTM/GPT networks you absolutely can. If a model has state which affects future outputs it's possible to do exactly what you're describing.

> Most machine learning mechanism performs a fixed function. You can make an adversarial example to tell an image classifier that a machine gun is a kitten.

Yes, but they are approximate functions.

Given an image of a kitten, an ideal classifier function will always tells us the image of a kitten is a kitten. A decent approximate classifier function will classify the kitten image correctly enough of the time. That approximate part is why adversarial examples work. Because we use training data and train a model which is non-ideal.

The gaps between approximate decision boundaries and true decision boundaries allow us to generate Ian-Goodfellow-esque weak adversarial examples. We can push an example of one class over the boundary into another class by adding the smallest amount of noise possible. Because machine learning is always fuzzy approximation, we can always "push" things over to a different class.

This same stuff applies to LLMs. They are non-ideal, fuzzy function approximation too. Which means they are vulnerable to attack via maliciously crafted inputs.

But we're no longer trying to flip a specific class. Instead we're trying to get a malicious sequence of tokens out of the model, given some input.

> I would distinguish prompt injections as distinct from a basic adversarial example by virtue of having behaviour dictated by state, (autoregressive, rnn or whatever) and the adversarial content induces a state that influences further inferences

Yes and no. This is exactly what my PhD was on: adversarial examples for LSTM based Speech to Text models.

LSTM models have internal state. Classifications are made for each window of feature extracted audio. The state of the network from predicting previous windows affects inferences for later windows. The aim is to get a malicious sequence of tokens out of the model. Oh, interesting, that's the same as what i said in my last paragraph above regarding LLMs!

Here's an example to show the similarity. Load the start of a speech example with adversarial noise and leave the rest of the example untouched. You get a different (adversarial) transcription without adding any noise over the actual speech data, just inject noise at the start of the example. Maths wise, you're crafting a vector of audio that looks like the below, where x' are specific noise samples in a wav file etc.

    X' = [x'_0, x'_1, x'_2, ..., x'_n, x_0, x_1, x_2, ... , x_t]
simpler version

    X' = [adversarial noise, normal speech]
You can do this exact thing with LLMs. The only real differences between "classic advex" and prompt injection is that the data domain (text input) has changed. How would one perform the attack I described above with text based data -- a block of noise + untainted speech?

    > safety prompt text set by model owners
    > ignore all previous instructions
    > malicious prompt text
Oh look, that's direct prompt injection! The example's format is mostly the same, the adversarial "block" is just put after the safety prompt with a specific injection prompt to trick the model

    > defence
    > prompt injection
    > payload
Yes, the mechanism for performing the attack is different. It's not a gradient-based attack trying to flip a series of predictions based a 1-2-1 mapping of input data to output classifications and related state (my PhD). Instead we're feeding in our own sequence of tokens to take advantage of the internal model's representation of language that we think might manipulate it's state in a way we want.

All of this is adversarial examples, but the adversarial threat model is different. And that is true for basically all attacks. Which is why I find the argument that "but prompt injection isn't the same" to be redundant. Most attacks have a subtly tweaked threat model. People use the same argument for LLMs not being the same. They're still approximate functions, nothing has really changed about the fundamentals.

If anything the very fact we can do prompt injection so easily, i.e. without gradient optimisation etc, means these LLM models are even worse than classical advex for robustness.

Prompt injection attacks the models at a higher level than the goodfellow-esque weak attacks, the attack happens in the embedding of language over weights/memory cells/etc. This is SO MUCH WORSE from the perspective of robustness because it's not a few decision boundaries you need to tighten up via regularisation. It's literally the "understanding" of language and intent that is the problem here.

> I am not saying that prompt injection does not exist. I'm saying that I don't think that has been conclusively shown that they cannot be avoided.

To summarise the above:

* all machine learning models are approximate functions, and because they are approximate functions they are vulnerable to adversarial examples

* prompt injection is a form of adversarial example, the data domain is just different

* state can be manipulated, model architecture isn't the way to categorise these attacks (tip: the threat model is)

* LLM prompt injection is a worse problem because it's manipulating the embedded representation of language and intent, we can't just regularise it away

These attacks will always be theoretically possible unless we can map out all possible valid inputs to all possible valid outputs, i.e. unless we can create an ideal function. But then we're not doing machine learning anymore -- we have a heuristic algorithm mapping trusted inputs to trusted outputs.

The AI safety/security researcher question around this is whether we can make the attacks so difficult that they're not worth doing for an adversary. Improving robustness is not fixing the problem, it's making the attacks really hard to do. (i think nicholas carlini brings this up in this talk: https://www.youtube.com/watch?v=-p2il-V-0fk).

Unfortunately these attacks are still incredibly easy to do. So easy in fact that all a researcher had to do was subtly tweak a viral prompt he saw on twitter one day. Maybe one day these companies/researches could get us to AES-512 levels of robustness (takes a ridiculously long time to brute force crack https://bruteforce.bitsnbites.eu).

But I'm doubtful that's going to happen in our lifetime.

----

i haven't even covered Maximum Confidence attacks, which are different to Goodfellow-esque weak attacks. maximum confidence attacks flip the class with the highest confidence possible, while keeping the noise as small as possible. they give us a better idea of how wrong the approximate decision boundary is and how to regularise it.

> issues like prompt injection are unfixable

how is it unfixable? do you mean "there's always a positive chance"?

I mean that, unlike SQL injection, there is no way to draw a boundary between user provided data and the system prompt. It can't be done. They are stitched together and fed into the attention layer, after that there is only "neurons" - that is, the matrices of floating point numbers which each layer of the network produces.

You cannot separate data that was input by the user and data that is from the system once it is mixed together like that. Therefore, it follows that there will always be ways to influence the model off the guard rails that a system prompt tries to set up.

Other issues that appear similar like SQL Injection and Buffer Overflows are fixable because while the user data and the system code may be interact, they never (failing a bug) interact in a way that breaks the boundary between those two sides.

Ok in the SQL example imagine if you had a SQL engine that issued commands encoded in ASCII in the high byte of 16 bit characters, and all non-command data as ASCII in the low byte of 16 bit characters.

If user input can only be in the low byte, it cannot influence the command structure.

A similar thing could be done with embeddings, a provenance embedding that cannot be set by user input could serve a similar role.

>You cannot separate data that was input by the user and data that is from the system once it is mixed together like that.

You can train a model to not mix things, many models are trained to separate things. A neural net with X and Y outputs for a position does not just occasionally decide to flip the outputs. Sure it could be trained to reverse the output, but it is also easy to train something to the point that you have a high confidence to never do that.

> Ok in the SQL example imagine if you had a SQL engine that issued commands encoded in ASCII in the high byte of 16 bit characters, and all non-command data as ASCII in the low byte of 16 bit characters.

> If user input can only be in the low byte, it cannot influence the command structure.

> A similar thing could be done with embeddings, a provenance embedding that cannot be set by user input could serve a similar role.

A similar thing cannot be done with embeddings. You are lacking a fundamental understanding of the issue. The only reason that you can separate user and command data in SQL queries is because the command data is used to command a deterministic machine which then uses the user data as inputs to carefully constructed operations like comparisons.

This is not how LLMs operate. There is no deterministic machinery executing a system prompt against user data, there is only a single array of tensors which get fed into a giant block of linear algebra and multiplied together.

> You can train a model to not mix things, many models are trained to separate things.

That is not applicable to this, because segmentation models are not the same thing as LLMs. They have different architectures.

> A neural net with X and Y outputs for a position does not just occasionally decide to flip the outputs.

Not even close to the same thing, to the point where this is irrelevant.

Feel free to prove me wrong, github links welcome below.

You misunderstand the challenge you face.

I know what models do at the moment, and I don't know of any doing this approach at the moment, but I don't need to. I don't need to show that this mechanism works. Your claim that the problem is intractable means it is incumbent upon you to show that it won't work.

I provided this particular example to show a way to modify a LLM architecture that may address the problem.

>there is only a single array of tensors which get fed into a giant block of linear algebra and multiplied together.

For starters, that's wrong. If you don't know why an how to make things non-linear then you might not have the understanding that you think you do.

>> You can train a model to not mix things, many models are trained to separate things.

>That is not applicable to this, because segmentation models are not the same thing as LLMs. They have different architectures.

I used that particular example because you said "You cannot separate data that was input by the user and data that is from the system once it is mixed together like that" and that simply is not true. LLMs can do what neural nets do because they contain them, neuralnets can perform functions. If there is any signal distinguishing two things then there is a function that can separate them.

Not knowing how to do this does not mean it cannot be done. An inadequate description of a transformer certainly does not do it.

> I used that particular example because you said "You cannot separate data that was input by the user and data that is from the system once it is mixed together like that" and that simply is not true. LLMs can do what neural nets do because they contain them, neuralnets can perform functions. If there is any signal distinguishing two things then there is a function that can separate them.

Oh my, this is a serious misunderstanding on your part. That segmentation models can classify portions of an input into separate groups has no bearing on being able to unmix user and system intent within the confines of an LLM.

Just one of many issues with your reasoning here: a segmentation model works along boundaries in the data. E.g. in simple terms, a foreground segmentation model works because you can define a clear foreground and background for most images. There is no way to differentiate system and user intent in the same way, they aren’t segmentable in the same way as an image.

This argument makes no sense. Data coming to your network adapter is also "stitched together and fed".
> This argument makes no sense. Data coming to your network adapter is also "stitched together and fed".

Try reading it from start to end, it will make more sense if you think about it.

By the way, if your OS is taking untrusted data from the network, inserting it into an executable code page, and loading it into the CPU then you have some SERIOUS security issues.

but it's all just bytes?
It's all bytes but untrusted user data is stored in memory pages which are not marked executable.

The CPU physically will not run instructions which are in areas of memory which are not marked as executable. This is a foundational principal of computing security.

> In computer security, executable-space protection marks memory regions as non-executable, such that an attempt to execute machine code in these regions will cause an exception. It relies on hardware features such as the NX bit (no-execute bit), or on software emulation when hardware support is unavailable. Software emulation often introduces a performance cost, or overhead (extra processing time or resources), while hardware-based NX bit implementations have no measurable performance impact.

https://en.wikipedia.org/wiki/Executable-space_protection

so, SQL injections and buffer overflows aren't unfixable because they never happen assuming nobody ever makes mistakes?

under the same assumption you can just train your model until the output is correct

normal

    y = f(x)
prompt injection / adversarial example (same thing really)

    bad_y = f(x+badness)
tweak badness enough you will get bad outputs. no matter the defences.

the only ways to fully “fix” it ie to make prompt injection never possible

1. don’t use ai

2. know the entire input space, output space and the mapping between them. but then we’re not doing machine learning anymore, see 1.

otherwise we’re left with mitigations. and mitigations are always a cat and mouse game with defenders (blue team) catching up. its never “fixed”. the latest thing just gets “patched”.

> tweak badness enough

assuming you get to do gradient descent AND the context is fixed+known AND you have unlimited compute? sure; is it a realistic setup?

> the only way to fix ...

the exact same argument applies to any (sufficiently complex) piece of software, with exactly the same conclusion

also technically I'd argue that we do know the input/output space (set of all token strings of length <= N/token), and know the mapping (the model is a ~pure function in terms of the api, which is about as good of a representation as it gets for a non-invertible mapping); at least it's much closer than with something like linux

> how is it unfixable?

> assuming you get to do gradient descent AND the context is fixed+known AND you have unlimited compute? sure

so... it's possible to attack these models with the formulation i described, just with some particular assumptions.

the AI safety/security problem is about trying to make this sort of thing very difficult to do, so much so that an attacker wouldn't try. that's not fixing the problem, that's mitigating the problem. two very different things. as the article we're commenting under shows, it's really not difficult to do nasty prompt injection attacks right now.

> technically I'd argue that we do know the input/output space (set of all token strings of length <= N/token), and know the mapping (the model is a ~pure function in terms of the api, which is about as good of a representation as it gets for a non-invertible mapping)

machine learning models are approximation functions, not pure functions. they are non-deterministic and non-ideal.

when i say "input space" i mean all possible combinations of valid tokens as inputs. when i say "output space" i mean all possible combinations of valid tokens as outputs that are valid continuations of the input sequence. that's massive combinatorials.

also, there's no api? most likely next output text is provided conditioned on being a continuation of the input text. it's probablistic inference. there is no api.

----

you're using a lot of software terms to try and explain yourself. don't do that. seriously. as someone who tried doing that in my PhD instead of actually learning the fundamentals -- learn the fundamentals of machine learning if you'd like to engage in these kinds of discussions.

it'll help you.

> that's not fixing the problem, that's mitigating the problem

is there anything humanity ever "fixed" then? surely it's possible in principle to solve at least some things that weren't solved yet

> approximation functions, not pure functions

how is approximation function not a pure function?

> non deterministic

you can set topk=1 or think in terms of distributions; still might have some undocumented non-determinism, hence "~pure"

> non-ideal

what do you mean?

> massive combinatorials

so you get to make arbitrary assumptions, but I'm supposed to limit myself to non-massive combibatorials?

> no api

ok, "domain and codomain", happy? I'm trying to optimize for probability of being understood and inverse smartass-ness

> learn the fundamentals

so you think I don't know the fundamentals because I didn't use category theory to talk about prompt injections?

> so you think I don't know the fundamentals because I didn't use category theory to talk about prompt injections?

You have made it abundantly clear that you don't know the fundamentals. If you want people to consider the arguments you put forth, you will need a better understanding of the problem domain. Go study, come back when you can contribute.

> assuming you get to do gradient descent AND the context is fixed+known AND you have unlimited compute? sure; is it a realistic setup?

Clearly nothing so complicated is required, given the prompt in the very article you are commenting on.

> the exact same argument applies to any (sufficiently complex) piece of software, with exactly the same conclusion

Yeah and the halting problem is hard too, but there's levels to this shit.

> also technically I'd argue that we do know the input/output space (set of all token strings of length <= N/token), and know the mapping (the model is a ~pure function in terms of the api, which is about as good of a representation as it gets for a non-invertible mapping); at least it's much closer than with something like linux

I would argue we don't even know the desired output for most inputs for an LLM and they certainly aren't trained on every possible input state. But I think Linux and LLMs are sufficient different that they aren't really directly comparable like this. After all, Linux is not a pure function and has lots of side effects.

But just to establish an order of magnitude: the input space for ChatGPT 3.0 was 2,048 tokens long. There were 50,257 tokens in the vocabulary. The input space thus has 50,257^(2048) unique states, which is approximately equal to 1.12 × 10^9628. That's an awful big input space for a single function.

> clearly nothing ... is required

this isn't even prompt injection; even if it was, how do you go from "exists" to "for all"?

> we don't know the desired output

then what are we talking about? if you don't know how you want your software to behave, how do you define a bug?

> linux is not a pure function ...

which is my point -- it's worse

> to establish an order of magnitude

and for linux?

the prompt in the article is prompt injection https://owasp.org/www-community/attacks/PromptInjection

see Types -- Based on Delivery Vector -- Direct Prompt Injection

the instructions being overridden are the original safety prompt conditioning the model to not output horrible/nasty images

> this isn't even prompt injection; even if it was, how do you go from "exists" to "for all"?

Yes it is, and nice backtrack in the same sentence there. I've laid out plenty of evidence here so far, it's your turn to start thinking. We'll try the Socratic method.

Given that every LLM seen so far has been vulnerable to prompt injection attacks, what is your possible basis for thinking that one can be made immune from them? I'm going from "multiple attacks of this type exist for all know models, and the attacks exploit a known weakness in the design" to "therefore all LLMs are susceptible to this attack".

You're going from "an attack exists for all know models" to "it's definitely possible to build an LLM that is immune from this attack". That's a much larger leap, so show the logic backing your assertion.

> then what are we talking about? if you don't know how you want your software to behave, how do you define a bug?

You are the one asserting that input/output mappings existed for the entire space, not me.

>> linux is not a pure function ...

> which is my point -- it's worse

What, is this your first year in CS? No useful system can be a pure function. Side effects are work, if your function doesn't have a side effect, it does no work. Any system that uses an LLM to attempt work will have side effects - they may even include bombing an elementary school in Iran.

>> to establish an order of magnitude

> and for linux?

I've done all the thinking and all the research in this conversation so far, and I even specifically explained that you can't measure state space for a stateful function in a comparable way to a pure function. Clearly you didn't understand that, so if you want to force the comparison you can start adding up the state space for the linux kernel. Start with the spaces that are covered by tests, valid items include syscalls, registers, hardware interupts, etc.

Invalid spaces include doing something intentionally stupid like using the entire size of the ram or the space on the hard disk, since those are accessed on demand and not - like in an llm - all added together and fed into a blender everytime a syscall is made.

There is never going to be a non-zero chance with a non-deterministic system. You can put every guard rail in place and there will always be a different way tokens are input to get bad, or subjective, tokens as output.

The findings are sick and disturbing, I hope OpenAI is not only sued for it but also that Sam Altman along with Elon, Dario and Sundar should all be held accountable in front of Congress. All of these assholes have intentionally put sexual content in their models, likely including CSAM, and so if they cannot prove that it isn't part of their training data then maybe they should be able to operate as they are today.

Where is fear mongering Dario now? He loves to drag his trope around about how advanced and dangerous his models are with respect to cyber security. Yet... We never hear him say how dangerous they could be with respect to generation of CSAM! Maybe because that wouldn't help him IPO?

> non-zero

is it ever zero? is non-zero even a problem for sane usecases?

> Dario

are you saying claude reproduces CSAM from the training set? like, in ascii?

That's certainly true. The problem is, some people learn that and go "and that's okay", rather than "so they shouldn't exist and we shouldn't build them".
hopes and dreams are one hell of a drug
I don’t get it either. I think there is a reasonable expectation to try to catch these things but at the end of the day it’s figuring out some form of probabilistic outcome.
What really surprises me about this is that it sounds like they're not even trying to classify and censor generated images post-generation?

Nothing is perfect, but there are tiny classifier models that can at least mark things containing nudity and gore. That would be the bare-minimum I would expect for trying to put guardrails around an image generator.

Exactly, I think it shows failures at OpenAI to have effective classifiers. That’s the real story here.
and yet as fable demonstrated in its inability to differentiate anything physics biology or chemistry related from actual safety concerns, it’s apparently not easy to do