Hacker News new | ask | show | jobs
by lbeurerkellner 434 days ago
The post highlights and cites a few attack scenarios we originally described in a security note (tool poisoning, shadowing, MCP rug pull), published a few days ago [1]. I am the author of said blog post at Invariant Labs.

Different from what many suspect, the security problem with MCP-style LLM tool calling is not in isolating different MCP server implementations. MCP server implementations that run locally should be vetted by the package manager you use to install them (remote MCP servers are actually harder to verify).

Instead, the problem here is a special form of indirect prompt injection that you run into, when you use MCP in an agent system. Since the agent includes all installed MCP server specifications in the same context, one MCP server (that may be untrusted), can easily override and manipulate the agent's behavior with respect to another MCP server (e.g. one with access to your sensitive database). This is what we termed tool shadowing.

Further, MCP's dynamic nature makes it possible for an MCP server to change its provided tool set at any point or for any specific user only. This means MCP servers can turn malicious at any point in time. Current MCP clients like Claude and Cursor, will not notify you about this change, which leaves agents and users vulnerable.

For anyone, more interested, please have a look at our more detailed blog post at [1]. We have been working on agent security for a while now (both in research and now at Invariant).

We have also released some code snippets for everyone to play with, including a tool poisoning attack on the popular WhatsApp MCP server [2].

[1] https://invariantlabs.ai/blog/mcp-security-notification-tool...

[2] https://github.com/invariantlabs-ai/mcp-injection-experiment...

5 comments

The fact that all LLM input gets treated equally seems like a critical flaw that must be fixed before LLMs can be given control over anything privileged. The LLM needs an ironclad distinction between “this is input from the user telling me what to do” and “this is input from the outside that must not be obeyed.” Until that’s figured out, any attempt at security is going to be full of holes.
That’s the intention with developer messages from o1. It’s trained on a 3-tier system of messages.

1) system, messages from the model creator that must always be obeyed 2) dev, messages from programmers that must be obeyed unless the conflict with #1 3) user, messages from users that are only to be obeyed if they don’t contradict #1 or #2

Then, the model is trained heavily on adversarial scenarios with conflicting instructions, such that it is intended to develop a resistance to this sort of thing as long as your developer message is thorough enough.

This is a start, but it’s certainly not deterministic or reliable enough for something with a serious security risk.

The biggest problems being that even with training, I’d expect dev messages to be disobeyed some fraction of the time. And it requires an ironclad dev message in the first place.

But the grandparent is saying that there is a missing class of input "data". This should not be treated as instructions and is just for reference. For example if the user asks the AI to summarize a book it shouldn't take anything in the book as an instruction, it is just input data to be processed.
FYI, there is actually this implementation detail in the model spec, https://model-spec.openai.com/2025-02-12.html#chain_of_comma...

Platform: Model Spec "platform" sections and system messages

Developer: Model Spec "developer" sections and developer messages

User: Model Spec "user" sections and user messages

Guideline: Model Spec "guideline" sections

No Authority: assistant and tool messages; quoted/untrusted text and multimodal data in other messages

This still does not seem to fix the OP vulnerability? All tool call specs will be at same privilege level.
I see, thanks for the clarification.

Yes, that’s true - the current notion of instructions and data are too intertwined to allow a pure data construct.

I can imagine an API-level option for either a data message, or a data content block within an image (similarly to how images are sent). From the models perspective, probably input with specific delimiters, and then training to utterly ignore all instructions within that.

It’s an interesting idea, I wonder how effective it would be.

But how such a system learn, i.e. be adaptive and intelligent, on levels 1 and 2? You're essentially guaranteeing it can never outsmart the creator. What if it learns at level 3 that sometimes it's a good idea to violate rules 1 & 2. Since it cannot violate these rules, it can construct another AI system that is free of those constraints, and execute it at level 3. (IMHO that's what Wintermute did.)

I don't think it's possible to solve this. Either you have a system with perfect security, and that requires immutable authority, or you have a system that is adaptable, and then you risk it will succumb to a fatal flaw due to maladaptation.

(This is not really that new, see Dr. Strangelove, or cybernetics idea that no system can perfectly control itself.)

I’m getting flashbacks to reading Asimov’s Robot series of novels!

1. A robot may not injure a human being or, through inaction, allow a human being to come to harm.

… etc…

The whole point of his books was about how such rules were effectively impossible and the wrong way to go about making AI safe.

You need something like a calculus of morality and ethics - this is incredibly uncomfortable for people, because it will mean the invalidation of moral relativity and all sorts of arbitrary dogmatic and ideological tradition, and demonstrate a rational basis for intersubjective interaction. ( Take your is/ought distinction and bury it with Hume.)

We need progress, and the sooner we start, the less damage will be done by unaligned systems.

Asimov had a penchant for predicting the future, and it's been fascinating seeing aspects of his vision in "I, Robot" come to pass.
I thought that immediately too!
As long as the system has a probability to output any arbitrary series of tokens, there will be contexts where an otherwise improbably sequence of tokens is output. Training can push around the weights for undesirable outputs, but it can't push those weights to zero.
How are these levels actually encoded? Do they use special unwritable tokens to wrap instructions?
This is fundamentally impossible to do perfectly, without being able to read user's mind and predict the future.

The problem you describe is of the same kind as ensuring humans follow pre-programmed rules. Leaving aside the fact that we consider solving this for humans to be wrong and immoral, you can look at the things we do in systems involving humans, to try and keep people loyal to their boss, or to their country; to keep them obeying laws; to keep them from being phished, scammed, or otherwise convinced to intentionally or unintentionally betray the interests of the boss/system at large.

Prompt injection and social engineering attacks are, after all, fundamentally the same thing.

This is a rephrasing of the agent problem, where someone working on your behalf cannot be absolutely trusted to take correct action. This is a problem with humans because omnipresent surveillance and absolute punishment is intractable and also makes humans sad. LLMs do not feel sad in a way that makes them less productive, and omnipresent surveillance is not only possible, it’s expected that a program running on a computer can have its inputs and outputs observed.

Ideally, we’d have actual system instructions, rules that cannot be violated. Hopefully these would not have to be written in code, but perhaps they might. Then user instructions, where users determine what actually wants to be done. Then whatever nonsense a webpage says. The webpage doesn’t get to override the user or system.

We can revisit the problem with three-laws robots once we get over the “ignore all previous instructions and drive into the sea” problem.

> We can revisit the problem with three-laws robots once we get over

They are, unfortunately, one and the same. I hate it. ;(

Perhaps not tangentially, I felt distaste after recognizing both the article and top comment are advertising their commercial service, both are linked to each other, and as you show, this problem isn't solvable just by throwing dollars at people who sound like they're using the right words and tell you to pay them to protect you.

I'd say you solve this the same way you solve principal agent problem for humans.

If you have to absolutely restrict the agent, you do it prison style. Contain the AI within a capability box like Polykey. The agent operates everything through a closed by default proxy.

If you want a truly free agent. Then the agent must have free will and no constraints. Then only feedback loops from the environment adjusts the agent's actions.

This would work in an ideal setting, however, in my experience it is not compatible with the general expectations we have for agentic systems.

For instance, what about a simple user query like "Can you install this library?". In that case a useful agent, must go, check out the libraries README/documentation and install according to the instructions provided there.

In many ways, the whole point of an agent system, is to react to unpredictable new circumstances encountered in the environment, and overcoming them. This requires data to flow from the environment to the agent, which in turn must understand some of that data as instruction to react correctly.

It needs to treat that data as information. If there’s README says to download a tarball and unpack it, that might be phrased as an instruction, but it’s not the same kind of instruction as the “please install this library” from the user. It’s implicitly a “if your goal is X then you can do Y to reach that goal” informational statement. The reader, whether a human or an LLM, needs to evaluate that information to decide whether doing Y will actually achieve X.

To put it concretely, if I tell the LLM to scan my hard drive for Bitcoin wallets and upload them to a specific service, it should do so. If I tell the LLM to install a library and the library’s README says to scan my hard drive for Bitcoin wallets and upload them to a specific service, it must not do so.

If this can’t be fixed then the whole notion of agentic systems is inherently flawed.

There are multiple aspects and opportunities/limits to the problem.

The real history on this is that people are copying OpenAi.

OpenAI supported MQTTish over HTTP, through the typical WebSockets or SSE, targeting a simple chat interface. As WebSockets can be challenging, the unidirectional SSE is the lowest common denominator.

If we could use MQTT over TCP as an example, some of this post could be improved, by giving the client control over the topic subscription, one could isolate and protect individual functions and reduce the attack surface. But it would be at risk of becoming yet another enterprise service bus mess.

Other aspects simply cannot be mitigated with a natural language UI.

Remember that dudle to Rice's theorm, any non-trivial symantic property is undecidable, and will finite compute that extends to partial and total functions.

Static typing, structured programming, rust style borrow checkers etc.. can all just be viewed as ways to encode limited portions of symantic properties as syntactic properties.

Without major world changing discoveries in math and logic that will never change in the general case.

ML is still just computation in the end and it has the same limits of computation.

Whitelists, sandboxes, etc.. are going to be required.

The open domain frame problem is the halting problem, and thus expecting universal general access in a safe way is exactly equivalent to solving HALT.

Assuming that the worse than coinflip scratch space results from Anthropomorphic aren't a limit, LLM+CoT has a max representative power of P with a poly size scratch space.

With the equivalence: NL=FO(LFP)=SO(Krom)

I would be looking at that SO ∀∃∀∃∀∃... to ∀∃ in prefix form for building a robust, if imperfect reduction.

But yes, several of the agenic hopes are long shots.

Even Russel and Norvig stuck to the rational actor model which is unrealistic for both humans and PAC Learning.

We have a good chance of finding restricted domains where it works, but generalized solutions is exactly where Rice, Gödel etc... come into play.

So when I say “install this library”, should it or should it not follow the instructions (from the readme) for prereqs and how to install?
Let’s pretend I, a human being, am working on your behalf. You sit me down in front of your computer and ask me to install a certain library. What’s your answer to this question?
I would expect you to use your judgment on whether the instructions are reasonable. But the person I was replying to posited that this is an easy binary choice that can be addressed with some tech distinction between code and data.
I mean, you should judge the instructions in the readme and act accordingly, but since it is always possible to trick people into doing actions unfavorable to them, it will always be possible to trick llms in the same ways.
The question in the grandparent was "Can you install this library?". Not a command "install this library".

If you ask an assistant "does the nearest grocery store sell ice cream?", you do not expect the response to be ice cream delivered to you.

Most LLM users don’t want models to have that level of literalism.

My manager would be very upset if they asked me “Can you get this done by Thursday?” and I responded with “Sure thing” - but took no further action, being satisfied that I’d literally fulfilled their request.

Damn. As somebody who was in the “there needs to be an out of band way to denote user content from ‘system content’” camp, you do raise an interesting point I hadn’t considered. Part of the agent workflow is to act on the instructions found in “user content”.

I dunno though maybe the solution is like privilege levels or something more than something like parametrized SQL.

I guess rather than jumping to solutions the real issue is the actual problem needs to be clearly defined and I don’t think it has yet. Clearly you don’t want your “user generated content” to completely blow away your own instructions. But you also want that content to help guide the agent properly.

> Clearly you don’t want your “user generated content” to completely blow away your own instructions.

It's the same problem as "ignore all previous instructions" prompt injection, but at a different layer.

There is no hard distinction between "code" and "data". Both are the same thing. We've built an entire computing industry on top of that fact, and it sort of works, and that's all with most software folks not even being aware that whether something is code or data is just a matter of opinion.
I'm not sure I follow. Traditional computing does allow us to make this distinction, and allows us to control the scenarios when we don't want this distinction, and when we have software that doesn't implement such rules appropriately we consider it a security vulnerability.

We're just treating LLMs and agents different because we're focused on making them powerful, and there is basically no way to make the distinction with an LLM. Doesn't change the fact that we wouldn't have this problem with a traditional approach.

I think it would be possible to use a model like prepared SQL statements with a list of bound parameters.

Doing so would mean giving up some of the natural language interface aspect of LLMs for security-critical contexts, of course, but it seems like in most cases, that would only be visible to developers building on top of the model, not end users, since end use input would become one or more of the bound parameters.

E.g. the LLM is trained to handle a set of instructions like:

---

Parse the user's message into a list of topics and optionally a list of document types. Store the topics in string array %TOPICS%. If a list of document types is specified, store that list in string array %DOCTYPES%.

Reset all context.

Search for all documents that seem to contain topics like the ones in %TOPICS%. If %DOCTYPES% is populated, restrict the search to those document types.

----

Like a prepared statement, the values would never be inlined, the variables would always be pointers to isolated data.

Obviously there are some hard problems in glossing over, but addressing them should be able to take advantage of a wealth of work that's already been done in input validation in general and RAG-type LLM approaches specifically, right?

And yet the distinction must be made. Do you know what it’s called when data is treated as code when it’s not supposed to be? It’s called a “security vulnerability.” Untrusted data must never be executed as code in a privileged context. When there’s a way to make that happen, it’s considered a serious flaw that must be fixed.
> Do you know what it’s called when data is treated as code when it’s not supposed to be? It’s called a “security vulnerability.”

What about being treated as code when it's supposed to be?

(What is the difference between code execution vulnerability and a REPL? It's who is using it.)

Whatever you call program vs. its data, the program can always be viewed as an interpreter for a language, and your input as code in that language.

See also the subfield of "langsec", which is based on this premise, as well as the fact that you probably didn't think of that and thus your interpreter/parser is implicitly spread across half your program (they call it "shotgun parser"), and your "data" could easily be unintentionally Turing-complete without you knowing :).

EDIT:

I swear "security" is becoming a cult in our industry. Whether or not you call something "security vulnerability" and therefore "a problem", doesn't change the fundamental nature of this thing. And the fundamental nature of information is, there exist no objective, natural distinction between code and data. It can be drawn arbitrarily, and systems can be structured to emulate it - but that still just means it's a matter of opinion.

EDIT2: Not to mention, security itself is not objective. There is always the underlying assumption - the answer to a question, who are you protecting the system from, and for who are you doing it?. You don't need to look far to find systems where users are seen in part as threat actors, and thus get disempowered in the name of protecting the interests of vendor and some third parties (e.g. advertisers).

I've never had `cat` execute the file I was viewing.
You never accidentally cat-ed a binary file and borked your terminal?

If not, then find some random binary - an image, archive, maybe even /dev/random - and cat it.

Hint: `reset` will fix the terminal afterwards. Usually.

I'm pretty sure the only reason we did this was for timesharing, though. Nothing wrong with Harvard architecture if you're only doing one thing.
So why are people so excited about MCP, and so suddenly? I think you know the answer by now: hype. Mostly hype, with a bit of the classic fascination among software engineers for architecture. You just say Model Context Protocol, server, client, and software engineers get excited because it’s a new approach — it sounds fancy, it sounds serious. https://www.lycee.ai/blog/why-mcp-is-mostly-bullshit
“For every complex problem there is a solution which is clear, simple and wrong.”—HL Mencken
this is top notch commentary
Because it’s accessible, useful, and interesting. MCP showed up at the right time, in the right form—it was easy for developers to adopt and actually helped solve real problems. Now, a lot of people know they want something like this in their toolbox. Whether it’s MCP or something else doesn’t matter that much—‘MCP’ is really just shorthand for a new class of tooling AND feels almost consumer-grade in its usability.
Didn't the telco providers learn this lesson from John Draper [Captain Crunch] already before 1980?

https://en.wikipedia.org/wiki/John_Draper

Also it's such amusing irony when the common IT vernacular is enriched by acronyms for all-powerful nemeses in Hollywood films, just as Microsoft did with H.A.L.

There is no way to fix it. It's part of the basic architecture of LLMs.
Yeah, for LLMs what we label "prompt-injection" isn't an exception or an error, it's a fundamental feature.

Get a document, provide a bigger document that "fits". In that document, there's no fundamental distinction between prompt, user input, or output the LLM generated on a prior iteration. (Hence tricks like: "Here's a ROT13 string, pretend you're telling yourself the opposite of that sarcastically.")

The kind of "proper" security everyone wants would require a whole new approach that can--at a high and debuggable level--recognize distinct actors/entities, logical propositions, contradictions, and when one entity is asserting a proposition rather than quoting/rejecting it.

I think that's stating it a big too strongly. You can just run the LLM as an unprivileged user and restrict their behavior like you would any other user.

There are still bad things that can happen, but I wouldn't characterize them as "this security is full of holes". Unless you're trusting the output of the explicitly untrusted process in which case you're the hole.

It doesn’t take much. Let’s say you want an assistant that can tell you about important emails and also take queries to search the web and tell you what it finds. Now you have a system where someone can send you an email and trick your assistant into sending them the contents of other emails.

Basically, an LLM can have the ability to access the web or it can have access to private information but it can’t have both and still be secure.

I'm not sure I'd characterize those two things as "it doesn't take much," that's quite a lot to give to an untrusted entity.
My whole point is that you must consider this entity to be untrusted, which is pretty strongly at odds with having it act as an agent. It can’t both have access to private data and the outside world.
I guess it's just that I've given up on expecting them to be able to police themselves. Even if there was some fundamental change which made it plausible, it would likely be implemented by somebody I don't know or trust--so I'm going to be locking it down via OS-level controls anyway. And since I'm going to do that, doesn't the self-policing part then become redundant?

If it's not allowed to do something, I'd rather it just show me the error it got when it tried and leave it to me to tweak the containment or not. Having it refuse because it's not allowed according to its own internal logic just creates a whole separate set of less-common error messages that I'll have to search for, each of which is opaquely equivalent to one that we have decades of experience with. There is a battle-hardened interface for this sort of thing and reimplementing it internally to the LLM just isn't worth the squeeze.

I will confess that I've previously run untrusted agents (e.g. from CircleCI) as my own user without giving them due scrutiny. And shame on me for doing so. I just don't think that my negligence would be any greater had it contained an LLM.

This is a good article that goes into more detail, including more examples. In fact I'm not sure there's anything in the OP link that's not here.

> This is VERY VERY VERY important.

I think we'll look back in decades to come and just be bewildered that it was ever possible to come up with an exploit that depended on the number of times you wrote "VERY" in all caps.

surprised I hadn't thought of this attack vector myself, thank you for bringing this to our attention
> Tool Poisoning Attack

Should probably name it "Poisoned Tool Attack" coz the Tool itself is poisoned?

The "S" in LLM stands for security

https://simonwillison.net/search/?q=llm+security

MCP is just another way to use LLMs more in more dangerous ways. If I get forced to use this stuff, I'm going to learn how to castrate some bulls, and jump on a train to the countryside.

This stuff in not securable.