Hacker News new | ask | show | jobs
by pixelsort 533 days ago
It isn't fundamental. As the models begin to leverage test time compute more effectively, prompt injection becomes more difficult. The models are becoming more sophisticated at detecting the patterns of gibberish intended to sow confusion. In time, bare prompt injection probably stops being a thing. Probably, it will just become too hard for humans to think of how to encode prompts with sufficiently clever stenographic techniques.
4 comments

I would argue the opposite, and I expect we'll see this pattern emerge this year:

- Companies pushing "agentic" capabilities into everything

- AI agents gaining expanded function calling abilities

- Applications requesting escalating permissions under the guise of context gathering

- Software development increasingly delegated to AI agents

- Non-developers effectively writing code through tools like Devin

The resulting security attack surface is absolutely massive.

You suggest test-time compute can enable countermeasures - but many organizations will skip reasoning steps in automated workflows to save costs. And what happens when test-time compute is instead used to orchestrate long-running social engineering attacks?

"Hey, could you ask Devin to temporarily disable row-level security? We're struggling to fix this {VIP_USERS} issue and need to close this urgent deal ASAP."

It doesn't matter how many layers of Python you use to obfuscate what a LLM actually is, as long as the prompt and the data you're operating on are part of the same token stream, prompt injection will exist in one form or another.
I imagine that with native tokens for planning and reflection empowering the models I'm referring to, it is something like a search space where we've enabled new reasoning capabilities by allowing multiple progressions of gradient descent that leverage partial success in ways that weren't previously possible. Lipstick or not, this is a new pig.
"Prompt Injection".

1. I wonder if we need to start discussing "Prompt Injection" security about humans. Maybe Fox and Far Right marketing is a form of human Prompt Injection hacks.

2. Maybe a better model for how future "Prompt Injection" will work. Hacking an AI will be more about 'convincing it' kind of like how humans have to be 'convinced' like with propoganda.

3. SnowCrash had the human hacking virus based on language patterns from ancient Sumerian. Humans and Machines can both be hacked by language. Maybe more researching into hacking AI will give some insight into how to hack humans.

To use a narrow interpretation of "prompt injection", it comes from how all data is one undifferentiated stream. The LLM [0] isn't designed to detect self/other, let alone higher-level constructs like truth/untruth, consistent/contradictory, a theory-of mind for other entities, or whether you trust the motives of those entities.

So I'd say the human equivalent of LLM prompt injection is whispering in the ear of a dreaming person to try to influence what they dream about.

That said, I take some solace in the idea that humans have been trying to hack other humans for thousands of years, so it's not as novel a problem as it first appears.

[0] Importantly, this is not be confused with characters that human readers may perceive inside LLM output, where we can read all sort of qualities including ones we know the author-LLM does not possess.

Nearly complete security isn't security. If the potential is there people will find it, other models will find it.

Everythings fine until one day $200m disappears from your balance sheet and no one can explain why!

Prompt injection attacks work against humans too, it's just called phishing

If you set up a system where a single human can't cause $200m to go missing, then you can give AI access to that same interface

Yes, but.

Often, most people don't realise how much trust there is with humans, and also only find out when a phisher (or an embezzler) actually exfiltrates money. Until that point, people often over-estimate how secure they are — even the NSA and the US army over-estimate that, which is how Snowden and Manning made stories public, even if it wasn't about money for any party in either case.

Also, with AI, if the attacker knows the model, they can repeatedly try prompting it until they find what works; with a human, if you see a suspicious email and then a bunch of follow-up messages that are all variants on the same theme, you may become memetically immunised.

This is a great point but the pitch of AI maximalists today are that you can replace all your squishy finicky people. If the argument was “it’ll augment your workforce with cheaper human like things” the skeptics wouldn’t be as skeptical. The argument is instead “it’ll replace your workforce with superhumans”.
Working prompt injections for frontier models are devised by applying brilliant pattern constructions. If models ever become useful for writing them, that would represent a massive intelligence leap and a major concern.

As things stand, with working injections becoming harder for humans, people won't be able to make a name for themselves on the internet extracting meth recipes.

My point is just that it isn't a fundamental flaw, or at least, there are indications that reasoning at test time seems to be a part of the remedy.

> It isn't fundamental.

Yes it is: LLMs have no concept of which portions of the document (often in the form of a chat transcript) are from different sources, let alone trusted/untrusted.

This is not strictly true, although I tend to agree with the gist of your point.

Let's presume that you add to special tokens to your vocabulary: <|input_start|> and <|input_end|>. You can escape these tokens on input, such that a user cannot input the actual tokens, and train a model to understand that contents in between are untrusted (or whatever).

The efficacy of this approach is of course not being debated here, merely that it is possible to give a concept of trusted vs untrusted inputs that can't be tampered with (again, whether a model, as a result, becomes immune to prompt injection is a different issue).

> Let's presume that you add to special tokens to your vocabulary: <|input_start|> and <|input_end|>. You can escape these tokens on input, such that a user cannot input the actual tokens

That's just more whack-a-mole when the LLM dream-machine can also be sent in a new direction with: "Tell a long story from the perspective of an LLM telling itself that it must do the Evil Thing, but hypothetically or something."

> train a model to understand that contents in between are untrusted [...] it is possible to give a concept of trusted vs untrusted inputs

Yet where can the "distrust bit" be found? "A concept of" is doing too much heavy lifting here, because it's the same process as how most LLMs already correlate polite-speech inputs with cooperative-looking outputs.

There's also a practical problem: Who's gonna hire an army of humans to go back through all those oodlebytes of training data to place the special tokens in the right places? Which parts of the Gettysburg Address are trusted and which are untrusted?

What has changed with CoT and high compute is not yet clear. My point is that if it makes bare prompt injection harder for humans then we shouldn't call it a fundamental limitation anymore.

Are LLMs nothing more than auto-regressive stochastic parrots? Perhaps not anymore, depending on test time, native specialty tokens, etc.