Hacker News new | ask | show | jobs
by dijksterhuis 8 days ago
normal

    y = f(x)
prompt injection / adversarial example (same thing really)

    bad_y = f(x+badness)
tweak badness enough you will get bad outputs. no matter the defences.

the only ways to fully “fix” it ie to make prompt injection never possible

1. don’t use ai

2. know the entire input space, output space and the mapping between them. but then we’re not doing machine learning anymore, see 1.

otherwise we’re left with mitigations. and mitigations are always a cat and mouse game with defenders (blue team) catching up. its never “fixed”. the latest thing just gets “patched”.

1 comments

> tweak badness enough

assuming you get to do gradient descent AND the context is fixed+known AND you have unlimited compute? sure; is it a realistic setup?

> the only way to fix ...

the exact same argument applies to any (sufficiently complex) piece of software, with exactly the same conclusion

also technically I'd argue that we do know the input/output space (set of all token strings of length <= N/token), and know the mapping (the model is a ~pure function in terms of the api, which is about as good of a representation as it gets for a non-invertible mapping); at least it's much closer than with something like linux

> how is it unfixable?

> assuming you get to do gradient descent AND the context is fixed+known AND you have unlimited compute? sure

so... it's possible to attack these models with the formulation i described, just with some particular assumptions.

the AI safety/security problem is about trying to make this sort of thing very difficult to do, so much so that an attacker wouldn't try. that's not fixing the problem, that's mitigating the problem. two very different things. as the article we're commenting under shows, it's really not difficult to do nasty prompt injection attacks right now.

> technically I'd argue that we do know the input/output space (set of all token strings of length <= N/token), and know the mapping (the model is a ~pure function in terms of the api, which is about as good of a representation as it gets for a non-invertible mapping)

machine learning models are approximation functions, not pure functions. they are non-deterministic and non-ideal.

when i say "input space" i mean all possible combinations of valid tokens as inputs. when i say "output space" i mean all possible combinations of valid tokens as outputs that are valid continuations of the input sequence. that's massive combinatorials.

also, there's no api? most likely next output text is provided conditioned on being a continuation of the input text. it's probablistic inference. there is no api.

----

you're using a lot of software terms to try and explain yourself. don't do that. seriously. as someone who tried doing that in my PhD instead of actually learning the fundamentals -- learn the fundamentals of machine learning if you'd like to engage in these kinds of discussions.

it'll help you.

> that's not fixing the problem, that's mitigating the problem

is there anything humanity ever "fixed" then? surely it's possible in principle to solve at least some things that weren't solved yet

> approximation functions, not pure functions

how is approximation function not a pure function?

> non deterministic

you can set topk=1 or think in terms of distributions; still might have some undocumented non-determinism, hence "~pure"

> non-ideal

what do you mean?

> massive combinatorials

so you get to make arbitrary assumptions, but I'm supposed to limit myself to non-massive combibatorials?

> no api

ok, "domain and codomain", happy? I'm trying to optimize for probability of being understood and inverse smartass-ness

> learn the fundamentals

so you think I don't know the fundamentals because I didn't use category theory to talk about prompt injections?

> so you think I don't know the fundamentals because I didn't use category theory to talk about prompt injections?

You have made it abundantly clear that you don't know the fundamentals. If you want people to consider the arguments you put forth, you will need a better understanding of the problem domain. Go study, come back when you can contribute.

> assuming you get to do gradient descent AND the context is fixed+known AND you have unlimited compute? sure; is it a realistic setup?

Clearly nothing so complicated is required, given the prompt in the very article you are commenting on.

> the exact same argument applies to any (sufficiently complex) piece of software, with exactly the same conclusion

Yeah and the halting problem is hard too, but there's levels to this shit.

> also technically I'd argue that we do know the input/output space (set of all token strings of length <= N/token), and know the mapping (the model is a ~pure function in terms of the api, which is about as good of a representation as it gets for a non-invertible mapping); at least it's much closer than with something like linux

I would argue we don't even know the desired output for most inputs for an LLM and they certainly aren't trained on every possible input state. But I think Linux and LLMs are sufficient different that they aren't really directly comparable like this. After all, Linux is not a pure function and has lots of side effects.

But just to establish an order of magnitude: the input space for ChatGPT 3.0 was 2,048 tokens long. There were 50,257 tokens in the vocabulary. The input space thus has 50,257^(2048) unique states, which is approximately equal to 1.12 × 10^9628. That's an awful big input space for a single function.

> clearly nothing ... is required

this isn't even prompt injection; even if it was, how do you go from "exists" to "for all"?

> we don't know the desired output

then what are we talking about? if you don't know how you want your software to behave, how do you define a bug?

> linux is not a pure function ...

which is my point -- it's worse

> to establish an order of magnitude

and for linux?

the prompt in the article is prompt injection https://owasp.org/www-community/attacks/PromptInjection

see Types -- Based on Delivery Vector -- Direct Prompt Injection

the instructions being overridden are the original safety prompt conditioning the model to not output horrible/nasty images

the model did what it wasn't instructed to do by the attacker -- the "prompt" has basically nothing to do with the output
> continues to be completely wrong about the basic facts

You’re being very rude to a number of people who have taken time to attempt to explain this - fairly basic - concept to you. If you aren’t willing or capable of engaging in conversations in good faith, then you shouldn’t engage in them at all.

> this isn't even prompt injection; even if it was, how do you go from "exists" to "for all"?

Yes it is, and nice backtrack in the same sentence there. I've laid out plenty of evidence here so far, it's your turn to start thinking. We'll try the Socratic method.

Given that every LLM seen so far has been vulnerable to prompt injection attacks, what is your possible basis for thinking that one can be made immune from them? I'm going from "multiple attacks of this type exist for all know models, and the attacks exploit a known weakness in the design" to "therefore all LLMs are susceptible to this attack".

You're going from "an attack exists for all know models" to "it's definitely possible to build an LLM that is immune from this attack". That's a much larger leap, so show the logic backing your assertion.

> then what are we talking about? if you don't know how you want your software to behave, how do you define a bug?

You are the one asserting that input/output mappings existed for the entire space, not me.

>> linux is not a pure function ...

> which is my point -- it's worse

What, is this your first year in CS? No useful system can be a pure function. Side effects are work, if your function doesn't have a side effect, it does no work. Any system that uses an LLM to attempt work will have side effects - they may even include bombing an elementary school in Iran.

>> to establish an order of magnitude

> and for linux?

I've done all the thinking and all the research in this conversation so far, and I even specifically explained that you can't measure state space for a stateful function in a comparable way to a pure function. Clearly you didn't understand that, so if you want to force the comparison you can start adding up the state space for the linux kernel. Start with the spaces that are covered by tests, valid items include syscalls, registers, hardware interupts, etc.

Invalid spaces include doing something intentionally stupid like using the entire size of the ram or the space on the hard disk, since those are accessed on demand and not - like in an llm - all added together and fed into a blender everytime a syscall is made.

> yes it is

agree to disagree

> every LLM has been vulnerable

and every OS had bugs

> show the logic

https://arxiv.org/pdf/1912.10077

> you are the one asserting mappings existed

I know? that's why I'm asking?

> no useful system can be a pure function

why not? surely you can describe useful systems with qm? evolution operator of a closed system seems pretty pure to me

it's almost as if you could reformulate anything such that the state was one of the arguments of the function

> you can start adding up the state space for the linux kernel

I can give you a lower bound -- (your estimate for LLMs)*2, as you could imagine state "running two instances of llama-cpp"

1) You’re still wrong, this is prompt injection.

2) You continue to have basic misunderstandings of the issue. That bugs exist in other things does not mean a core design flaw in LLMs can magically be fixed.

3) https://arxiv.org/pdf/1912.10077

This paper doesn’t have any bearing to the question of the separation of user and command data in LLMs. Did you even bother to look at it?

4) Hey you’re the one that made the claim. If you can't event remember why, I can’t help you.

5) Because the world is stateful.

6) Wow so you just decided to add up all the ram after all, huh? If you want to play stupid, like you can’t understand why a real-world linux distribution is stateful while an ideal LLM isn’t, then we can play stupid.

By the broken logic you are trying to apply here, the state space of chatGPT includes the VRAM of all 10,000 GPUs your query runs across. It includes the memory in your computer, it includes the stack of the js interpreter in your browser, it includes the linux kernel itself that all those servers are running on, and so on.