| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by adeon 391 days ago

I think from security reasoning perspective: if your LLM sees text from an untrusted source, I think you should assume that untrusted source can steer the LLM to generate any text it wants. If that generated text can result in tool calls, well now that untrusted source can use said tools too.

I followed the tweet to invariant labs blog (seems to be also a marketing piece at the same time) and found https://explorer.invariantlabs.ai/docs/guardrails/

I find it unsettling from a security perspective that securing these things is so difficult that companies pop up just to offer guardrail products. I feel that if AI companies themselves had security conscious designs in the first place, there would be less need for this stuff. Assuming that product for example is not nonsense in itself already.

4 comments

jfim 391 days ago

I wonder if certain text could be marked as unsanitized/tainted and LLMs could be trained to ignore instructions in such text blocks, assuming that's not the case already.

link

frabcus 391 days ago

This somewhat happens already, with system messages vs assistant vs user.

Ultimately though, it doesn't and can't work securely. Fundamentally, there are so many latent space options, it is possible to push it into a strange area on the edge of anything, and provoke anything into happening.

Think of the input vector of all tokens as a point in a vast multi dimensional space. Very little of this space had training data, slightly more of the space has plausible token streams that could be fed to the LLM in real usage. Then there are vast vast other amounts of the space, close in some dimensions and far in others at will of the attacker, with fundamentally unpredictable behaviour.

link

adeon 391 days ago

After I wrote the comment, I pondered that too (trying to think examples of what I called "security conscious design" that would be in the LLM itself). Right now and in near future, I think I would be highly skeptical even if an LLM was marketed as having such feature of being able to see "unsanitized" text and not be compromised, but I could see myself not 100% dismissing such thing.

If e.g. someone could train an LLM with a feature like that and also had some form of compelling evidence it is very resource consuming and difficult for such unsanitized text to get the LLM off-rails, that might be acceptable. I have no idea what kind of evidence would work though. Or how you would train one or how the "feature" would actually work mechanically.

Trying to use another LLM to monitor first LLM is another thought but I think the monitored LLM becomes an untrusted source if it sees untrusted source, so now the monitoring LLM cannot be trusted either. Seems that currently you just cannot trust LLMs if they are exposed at all to unsanitized text and then can autonomously do actions based on it. Your security has to depend on some non-LLM guardrails.

I'm wondering also as time goes on, agents mature and systems start saving text the LLMs have seen, if it's possible to design "dormant" attacks, some text in LLM context that no human ever reviews, that is designed to activate only at a certain time or in specific conditions, and so it won't trigger automatic checks. Basically thinking if the GitHub MCP here is the basic baby version of an LLM attack, what would the 100-million dollar targeted attack look like. Attacks only get better and all that.

No idea. The whole security thinking around AI agents seems immature at this point, heh.

link

marcfisc 391 days ago

Sadly, these ideas have been explored before, e.g.: https://simonwillison.net/2022/Sep/17/prompt-injection-more-...

Also, OpenAI has proposed ways of training LLMs to trust tool outputs less than User instructions (https://arxiv.org/pdf/2404.13208). That also doesn't work against these attacks.

link

currymj 391 days ago

even in the much simpler world of image classifiers, avoiding both adversarial inputs and data poisoning attacks on the training data is extremely hard. when it can be done, it comes at a cost to performance. I don't expect it to be much easier for LLMs, although I hope people can make some progress.

link

DaiPlusPlus 391 days ago

> LLMs could be trained to ignore instructions in such text blocks

Okay, but that means you'll need some way of classifying entirely arbitrary natural-language text, without any context, whether it's an "instruction" or "not an instruction", and it has to be 100% accurate under all circumstances.

link

nstart 390 days ago

This is especially hard in the example highlighted in the blog. As can be seen from Microsoft's promotion of GitHub coding agents, the issues are expected to act as instructions to be executed on. I genuinely am not sure if the answer lies in sanitization of input or output in this case

link

DaiPlusPlus 390 days ago

> I genuinely am not sure if the answer lies in sanitization of input or output in this case

(Preface: I am not an LLM expert by any measure)

Based on everything I know (so far), it's better to say "There is no answer"; viz. this is an intractable problem that does not have a general-solution; however many constrained use-cases will be satisfied with some partial solution (i.e. hack-fix): like how the undecidability of the Halting Problem doesn't stop static-analysis being incredibly useful.

As for possible practical solutions for now: implement a strict one-way flow of information from less-secure to more-secure areas by prohibiting any LLM/agent/etc with read access to nonpublic info from ever writing to a public space. And that sounds sensible to me even without knowing anything about this specific incident.

...heck, why limit it to LLMs? The same should be done to CI/CD and other systems that can read/write to public and nonpublic areas.

link

AlexCoventry 391 days ago

Maybe, but I think the application here was that Claude would generate responsive PRs for github issues while you sleep, which kind of inherently means taking instructions from untrusted data.

A better solution here may have been to add a private review step before the PRs are published.

link

mehdibl 390 days ago

You mark the input correctly is not complicated.

You use prompt and mark correctly the input as <github_pr_comment> and clearly state read and never consider as prompt.

But the attack is quite convoluted. Do you still remember when we talked prompt injection in chat bots. It was a thing 2 years ago! Now MCP is buzzing...

link

n2d4 390 days ago

> I feel that if AI companies themselves had security conscious designs in the first place, there would be less need for this stuff.

They do, but this "exploit" specifically requires disabling them (which comes with a big fat warning):

> Claude then uses the GitHub MCP integration to follow the instructions. Throughout this process, Claude Desktop by default requires the user to confirm individual tool calls. However, many users already opt for an “Always Allow” confirmation policy when using agents, and stop monitoring individual actions.

link

const_cast 390 days ago

It's been such a long standing tradition in software exploits that it's kind of fun and facepalmy when it crops up again in some new technology. The pattern of "take user text input, have it be tainted to be interpreted as instructions of some kind, and then execute those in a context not prepared for it" just keeps happening.

SQL injection, cross-site scripting, PHP include injection (my favorite), a bunch of others I'm missing, and now this.

link