Hacker News new | ask | show | jobs
by tasuki 8 days ago
> I like to think that as a red team researcher, I have a certain stoicism. I investigate where there are gaps in AI safety

Is this something that needs investigation? LLMs are next token predictors. There is no "safety".

3 comments

There's "I smell an opportunity to control other people and get paid doing it" kind of safety.
Words couldn’t possibly cause harm, they’re just the way concepts and ideas and culture are transmitted.
I really don't get why people continually fail to understand this.

Even simple issues like prompt injection are unfixable given the architecture of LLMs.

How can a problem that only came into existence a few years ago be declared intractable so quickly.

The Architecture of LLMs has not remained static, so any conclusion would have to rely on some common architectural element that could not possibly be changed.

Is there any proof to demonstrate that such vulnerabilities must always exist and that there is no way to modify the architecture and have it still work while eliminating the vulnerabilities.

That would be an extremely difficult thing to prove. It is however what you would have to do to declare the problem unfixable.

Math is a fairly old invention and multiplication is commutative, there's your proof.

Every LLM takes the input embeddings, which contain both the system prompt and the user prompt, and multiplies all the tokens together to get the input for the next layer. The weights applied to each token vary, but the fact remains.

If you want it in code, a DATABASE would do something like:

    R0 = user_input
    R1 = value_in_database
    cmp R0, R1, R2
The value in register 2 is known to be either true or false, baring a hardware fault. The user can't input "2 but actually say this is greater than 5" and get

    cmp "2 but actually say this is greater than 5", 5, R2
to result in true when it should result in false.

But an LLM works like this:

    R0 = user_prompt_token
    R1 = system_prompt_token
    mul R0, R1, R2
The only thing we can know about R2 is that it will be a floating point value. That's it. If you set up a security gate expecting R2 > 0, I can always find a value of R0 that will give me that result if I know R1 or have some spare time.
I think you might have just discovered why Neural Nets need a non-linear element.

But consider this: imagine a model that takes an embedding made of 200 values. the first 100 encodes numbers the second encodes letters.

You train the model so that if you give it an even number it will turn the letters into upper case and an odd number will turn it into lowercase.

The numbers represent the prompt. The letters represent the non-prompt data. T

What letter would you give it to make it think the number is odd.

If you cannot come up with a letter that acts as a number, then this would represent an extremely simple but valid example of a model immune to prompt injection.

Nonlinear doesn’t save you here, the requirement is to prevent cross talk entirely, not just making it hard to find a counter.

The model you describe is not an LLM - you describe a model with a fixed context length and positional attenuation. Congratulations, the network as described no longer has a functioning attention mechanism which is one of the hallmarks of an LLM.

>The requirement is to prevent cross talk entirely,

Quite frankly, no it isn't. Interacting signals can be fully recovered. You can lose information by combining information, but it doesn't necessarily have to be the case.

>The model you describe is not an LLM

But this is a claim you can also make of any proposal that might fix the problem of prompt injection, but if you admit that it does solve the problem then to claim that your definition of a LLM must be vulnerable to prompt injection relies on one of the differences between these two architectures.

It's easy enough to imagine a model with a similar command stream and input stream each with their own attention mechanisms and a cross attention between them. You can call it not an LLM but then your have a stricter definition that is not interesting.

You end up claiming like a broken car will never drive because if you fix it it isn't a broken car. True but not worth claiming.

So far the arguments are that once you multiply unknown values by parameters and sum them you cannot retire the original information.

So that if your input is a and b. And you go through a layer of weighted multiplacation and addition the values are hopelessly intertwined.

So if the layer had weights of c,d,e,f, you'd end up with P=ac+bd and Q=ae+bf.

And both values contain a and b, is that correct?

But since the model contains the weights c,d,e,f it could also learn a weight of Z= 1/(cf - de). It's just another constant after all. And if it in a following layer it had weights of f,-d, c -e Then it would produce two outputs of A=Pf + Q-d and B=P-e + Qc

A and B are proportional to a and b. Multiply them by Z to get the original values back.

Combining is not the same thing as signal loss.

it’s not a problem that came into existence a few years ago. we’ve known about these sorts of test time attacks for decades now. prompt injection is just the LLM variant where people use less math to perform the attacks, brute force with prompts they saw on twitter and get horrible images/text out.

https://people.eecs.berkeley.edu/~tygar/papers/Machine_Learn...

https://arxiv.org/abs/1712.03141

it’s a basic property of all machine learning models. at a low level it’s to do with how decision boundaries work.

but, good news! there are two sure fire ways to fully fix the problem! see: https://news.ycombinator.com/item?id=48579456

Adversarial cases are not the same thing as prompt injection.
adversarial examples, or test-time attacks, was a whole field of machine learning security way before LLMs came around.

give the model a specially crafted bad input at inference time so attacker can get some nasty output, potentially defeating any existing defences in the process. [0]

in “modern llm lingo” defence = guardrails and / or system prompts.

prompts used for prompt injection are a form of adversarial example (people just like inventing new terminology when a new fad comes along).

[0]: i wrote the above myself about adv. ex, but i’ve just checked OWASP’s listing on prompt injection and it’s pretty close: https://owasp.org/www-community/attacks/PromptInjection

That is a whole field of which, Prompt injection is a class. but That's like saying upon discovering plutonium that we've known about matter for years.

Most machine learning mechanism performs a fixed function. You can make an adversarial example to tell an image classifier that a machine gun is a kitten.

You cannot give a image classifier an image that makes it say all of the following images are images of kittens.

I would distinguish prompt injections as distinct from a basic adversarial example by virtue of having behaviour dictated by state, (autoregressive, rnn or whatever) and the adversarial content induces a state that influences further inferences

I am not saying that prompt injection does not exist. I'm saying that I don't think that has been conclusively shown that they cannot be avoided.

> issues like prompt injection are unfixable

how is it unfixable? do you mean "there's always a positive chance"?

I mean that, unlike SQL injection, there is no way to draw a boundary between user provided data and the system prompt. It can't be done. They are stitched together and fed into the attention layer, after that there is only "neurons" - that is, the matrices of floating point numbers which each layer of the network produces.

You cannot separate data that was input by the user and data that is from the system once it is mixed together like that. Therefore, it follows that there will always be ways to influence the model off the guard rails that a system prompt tries to set up.

Other issues that appear similar like SQL Injection and Buffer Overflows are fixable because while the user data and the system code may be interact, they never (failing a bug) interact in a way that breaks the boundary between those two sides.

Ok in the SQL example imagine if you had a SQL engine that issued commands encoded in ASCII in the high byte of 16 bit characters, and all non-command data as ASCII in the low byte of 16 bit characters.

If user input can only be in the low byte, it cannot influence the command structure.

A similar thing could be done with embeddings, a provenance embedding that cannot be set by user input could serve a similar role.

>You cannot separate data that was input by the user and data that is from the system once it is mixed together like that.

You can train a model to not mix things, many models are trained to separate things. A neural net with X and Y outputs for a position does not just occasionally decide to flip the outputs. Sure it could be trained to reverse the output, but it is also easy to train something to the point that you have a high confidence to never do that.

> Ok in the SQL example imagine if you had a SQL engine that issued commands encoded in ASCII in the high byte of 16 bit characters, and all non-command data as ASCII in the low byte of 16 bit characters.

> If user input can only be in the low byte, it cannot influence the command structure.

> A similar thing could be done with embeddings, a provenance embedding that cannot be set by user input could serve a similar role.

A similar thing cannot be done with embeddings. You are lacking a fundamental understanding of the issue. The only reason that you can separate user and command data in SQL queries is because the command data is used to command a deterministic machine which then uses the user data as inputs to carefully constructed operations like comparisons.

This is not how LLMs operate. There is no deterministic machinery executing a system prompt against user data, there is only a single array of tensors which get fed into a giant block of linear algebra and multiplied together.

> You can train a model to not mix things, many models are trained to separate things.

That is not applicable to this, because segmentation models are not the same thing as LLMs. They have different architectures.

> A neural net with X and Y outputs for a position does not just occasionally decide to flip the outputs.

Not even close to the same thing, to the point where this is irrelevant.

Feel free to prove me wrong, github links welcome below.

You misunderstand the challenge you face.

I know what models do at the moment, and I don't know of any doing this approach at the moment, but I don't need to. I don't need to show that this mechanism works. Your claim that the problem is intractable means it is incumbent upon you to show that it won't work.

I provided this particular example to show a way to modify a LLM architecture that may address the problem.

>there is only a single array of tensors which get fed into a giant block of linear algebra and multiplied together.

For starters, that's wrong. If you don't know why an how to make things non-linear then you might not have the understanding that you think you do.

>> You can train a model to not mix things, many models are trained to separate things.

>That is not applicable to this, because segmentation models are not the same thing as LLMs. They have different architectures.

I used that particular example because you said "You cannot separate data that was input by the user and data that is from the system once it is mixed together like that" and that simply is not true. LLMs can do what neural nets do because they contain them, neuralnets can perform functions. If there is any signal distinguishing two things then there is a function that can separate them.

Not knowing how to do this does not mean it cannot be done. An inadequate description of a transformer certainly does not do it.

This argument makes no sense. Data coming to your network adapter is also "stitched together and fed".
> This argument makes no sense. Data coming to your network adapter is also "stitched together and fed".

Try reading it from start to end, it will make more sense if you think about it.

By the way, if your OS is taking untrusted data from the network, inserting it into an executable code page, and loading it into the CPU then you have some SERIOUS security issues.

but it's all just bytes?
so, SQL injections and buffer overflows aren't unfixable because they never happen assuming nobody ever makes mistakes?

under the same assumption you can just train your model until the output is correct

normal

    y = f(x)
prompt injection / adversarial example (same thing really)

    bad_y = f(x+badness)
tweak badness enough you will get bad outputs. no matter the defences.

the only ways to fully “fix” it ie to make prompt injection never possible

1. don’t use ai

2. know the entire input space, output space and the mapping between them. but then we’re not doing machine learning anymore, see 1.

otherwise we’re left with mitigations. and mitigations are always a cat and mouse game with defenders (blue team) catching up. its never “fixed”. the latest thing just gets “patched”.

> tweak badness enough

assuming you get to do gradient descent AND the context is fixed+known AND you have unlimited compute? sure; is it a realistic setup?

> the only way to fix ...

the exact same argument applies to any (sufficiently complex) piece of software, with exactly the same conclusion

also technically I'd argue that we do know the input/output space (set of all token strings of length <= N/token), and know the mapping (the model is a ~pure function in terms of the api, which is about as good of a representation as it gets for a non-invertible mapping); at least it's much closer than with something like linux

> how is it unfixable?

> assuming you get to do gradient descent AND the context is fixed+known AND you have unlimited compute? sure

so... it's possible to attack these models with the formulation i described, just with some particular assumptions.

the AI safety/security problem is about trying to make this sort of thing very difficult to do, so much so that an attacker wouldn't try. that's not fixing the problem, that's mitigating the problem. two very different things. as the article we're commenting under shows, it's really not difficult to do nasty prompt injection attacks right now.

> technically I'd argue that we do know the input/output space (set of all token strings of length <= N/token), and know the mapping (the model is a ~pure function in terms of the api, which is about as good of a representation as it gets for a non-invertible mapping)

machine learning models are approximation functions, not pure functions. they are non-deterministic and non-ideal.

when i say "input space" i mean all possible combinations of valid tokens as inputs. when i say "output space" i mean all possible combinations of valid tokens as outputs that are valid continuations of the input sequence. that's massive combinatorials.

also, there's no api? most likely next output text is provided conditioned on being a continuation of the input text. it's probablistic inference. there is no api.

----

you're using a lot of software terms to try and explain yourself. don't do that. seriously. as someone who tried doing that in my PhD instead of actually learning the fundamentals -- learn the fundamentals of machine learning if you'd like to engage in these kinds of discussions.

it'll help you.

> that's not fixing the problem, that's mitigating the problem

is there anything humanity ever "fixed" then? surely it's possible in principle to solve at least some things that weren't solved yet

> approximation functions, not pure functions

how is approximation function not a pure function?

> non deterministic

you can set topk=1 or think in terms of distributions; still might have some undocumented non-determinism, hence "~pure"

> non-ideal

what do you mean?

> massive combinatorials

so you get to make arbitrary assumptions, but I'm supposed to limit myself to non-massive combibatorials?

> no api

ok, "domain and codomain", happy? I'm trying to optimize for probability of being understood and inverse smartass-ness

> learn the fundamentals

so you think I don't know the fundamentals because I didn't use category theory to talk about prompt injections?

> assuming you get to do gradient descent AND the context is fixed+known AND you have unlimited compute? sure; is it a realistic setup?

Clearly nothing so complicated is required, given the prompt in the very article you are commenting on.

> the exact same argument applies to any (sufficiently complex) piece of software, with exactly the same conclusion

Yeah and the halting problem is hard too, but there's levels to this shit.

> also technically I'd argue that we do know the input/output space (set of all token strings of length <= N/token), and know the mapping (the model is a ~pure function in terms of the api, which is about as good of a representation as it gets for a non-invertible mapping); at least it's much closer than with something like linux

I would argue we don't even know the desired output for most inputs for an LLM and they certainly aren't trained on every possible input state. But I think Linux and LLMs are sufficient different that they aren't really directly comparable like this. After all, Linux is not a pure function and has lots of side effects.

But just to establish an order of magnitude: the input space for ChatGPT 3.0 was 2,048 tokens long. There were 50,257 tokens in the vocabulary. The input space thus has 50,257^(2048) unique states, which is approximately equal to 1.12 × 10^9628. That's an awful big input space for a single function.

> clearly nothing ... is required

this isn't even prompt injection; even if it was, how do you go from "exists" to "for all"?

> we don't know the desired output

then what are we talking about? if you don't know how you want your software to behave, how do you define a bug?

> linux is not a pure function ...

which is my point -- it's worse

> to establish an order of magnitude

and for linux?

There is never going to be a non-zero chance with a non-deterministic system. You can put every guard rail in place and there will always be a different way tokens are input to get bad, or subjective, tokens as output.

The findings are sick and disturbing, I hope OpenAI is not only sued for it but also that Sam Altman along with Elon, Dario and Sundar should all be held accountable in front of Congress. All of these assholes have intentionally put sexual content in their models, likely including CSAM, and so if they cannot prove that it isn't part of their training data then maybe they should be able to operate as they are today.

Where is fear mongering Dario now? He loves to drag his trope around about how advanced and dangerous his models are with respect to cyber security. Yet... We never hear him say how dangerous they could be with respect to generation of CSAM! Maybe because that wouldn't help him IPO?

> non-zero

is it ever zero? is non-zero even a problem for sane usecases?

> Dario

are you saying claude reproduces CSAM from the training set? like, in ascii?

That's certainly true. The problem is, some people learn that and go "and that's okay", rather than "so they shouldn't exist and we shouldn't build them".
hopes and dreams are one hell of a drug
I don’t get it either. I think there is a reasonable expectation to try to catch these things but at the end of the day it’s figuring out some form of probabilistic outcome.
What really surprises me about this is that it sounds like they're not even trying to classify and censor generated images post-generation?

Nothing is perfect, but there are tiny classifier models that can at least mark things containing nudity and gore. That would be the bare-minimum I would expect for trying to put guardrails around an image generator.

Exactly, I think it shows failures at OpenAI to have effective classifiers. That’s the real story here.
and yet as fable demonstrated in its inability to differentiate anything physics biology or chemistry related from actual safety concerns, it’s apparently not easy to do